Description

Description

Table of Contents

 

Introduction

Graphs are a powerful data structure that can be applied to several problems in bioinformatics. For instance, a lot of biological data can be represented as graphs: molecule structures, protein secondary structures, protein interaction networks. Since all these data are store in huge biological data banks, one of the most important challenges for bioinformatics is the development of efficient Pattern Recognition tools to retrieve relevant information. Among these, the search for patterns inside a biological database can be formulated as a graph matching problem.

Graph matching within a biological database is a very computationally-intensive problem, especially because of the size of the graphs involved. For instance, if we consider a generic protein database, while a pattern can be a small graph of ten nodes, a complete protein graph, where the pattern is to be searched, can have thousands of nodes and edges. So the number of candidate solutions for all the proteins can be huge.

To reduce the computational complexity, which in the worst case is exponential, some researchers have put their efforts in the development of more or less complex heuristics, that can decrease the matching time in the average case, at least for some classes of graphs, but still retain the exponential worst case; others have devoted their attention to inexact, suboptimal matching algorithms, that achieve a polynomial time, but do not ensure the accuracy of the found solutions, which may or may not be acceptable depending on the application. Besides the time complexity, space requirements of the algorithms also are an important concern when working with the large graphs of bioinformatics applications.

The aim of this contest is to assess the state-of-the-art about graph matching algorithms suitable for bioinformatics graphs, using both exact and inexact approaches. The participants will be evaluated considering the time and space requirements of the algorithms, and, for the inexact ones, the accuracy of the found solutions.

Go back to the top

Competition Task

The competition is about the application of graph matching algorithms for solving the problem of finding a pattern structure in a graph database containing molecules, protein structures and interaction networks.

The contest organizers will provide the participants with the entire original dataset needed to perform the task.

During the experimental phase of the competition, the graphs, both target and pattern, contained in the dataset will have their nodes randomly permutated.

A competitor will propose a graph matching algorithm, either exact or inexact, to find a pattern graph (subgraph) inside a target one. Since the solution can be both exact and approximate, each algorithm will be judged using three kinds of indexes: precision, time and memory requirements.

Go back to the top

Database

A new graph database have been built to perform the task proposed for the challenge. This database contains a set of large graphs extracted from biological structures. The graphs in database have been grouped in the following categories:

Molecules (small, sparse, directed)
  • Target Graphs: 10000
    • Size: from 8 to 99 nodes
    • Avg Degree: 2 edges per nodes
  • Query Graphs: 50 (10 for size)
    • Size: 4,8,16,32,64 nodes
  • Label Diversity: 4-5 labels
Proteins and Backbones (large, sparse, directed)
  • Target Graphs: 300
    • Size: from 535 to 10081 nodes
    • Avg Degree: 2 edges per node
  • Query Graphs: 60 (10 for size)
    • Size: 8,16,32,64,128,256 nodes
  • Label Diversity: 4-5 labels
Protein Contact Maps (medium, dense, directed)
  • Target Graphs: 300
    • Size: from 99 to 733 nodes
    • Avg Degree: 20 edges per node
  • Query Graphs: 60 (10 for size)
    • Size: 8,16,32,64,128,256 nodes
  • Label Diversity: 21 labels

For each category, a set of smaller subgraphs, extracted from the graphs, have also been provided. Each subgraph have been used as a pattern to be found within a target graph of the same category. The generated patterns have been grouped by density and size.

The dataset for the competition has been realized using data from different biological databases:

  • Protein Data Bank (PDB): PDB is a key resource in proteomics and other areas of structural biology. PDB is a repository for the 3-D structural data of large biological molecules, such as protein and nucleic acids.
  • National Center for Biotechnology Information (NCBI): NCBI houses a series of databases relevant to biotechnology and biomedicine, with particular reference to genomics.

Go back to the top

 

Metrics

The indices used to evaluate the algorithms participating to the contest will be described in the following subsections.

Accuracy

Accuracy measures are of interest mostly for inexact algorithms; an exact algorithm should always provide an accurate solution, unless there is a bug in its implementation. Two kinds of accuracy measures will be considered:

  • Number of solutions: The aim of this metric is to evaluate the ability of the algorithm to retrieve all the target graphs that contain a pattern. Given a pattern and a target graph, the number of the matchings found by the algorithm is compared with the ground truth, obtaining the following quantities:
    • TP = number of true positives, i.e. matchings found by the algorithm and present in the ground truth
    • FP = number of false positives, i.e. matchings found by the algorithm but not present in the ground truth
    • FN = number of false negatives, i.e. matchings present in the ground truth that the algorithm is not able to find.

    Then, two information retrieval indices, i.e. Precision and Recall, are computed over the dataset:

    • Precision = TP / (TP + FP)
    • Recall = TP / (TP + FN)
  • Graph Edit Distance: The aim of this metric is to appraise how close the found solutions are to the optimal ones. In order to compare a solution with the pattern graph, the graph edit distance normalized with respect to the dimension of the pattern is used. Thus, the accuracy for a given pattern and target graph is the average value of the normalized distances of all the proposed solutions. To summarize the obtained results, the average normalized distance is calculated over the entire dataset. In order to compute the graph edit distance, the following graph edit costs are defined:
    • Node addition/removal: 1
    • Edge addition/removal: 1
    • Node label substitution: 1.5
Time

Two kinds of time measures will be considered:

  • Total Time: time in milliseconds needed for obtaining all the solutions.
  • Time for the first matching: time required to get the first solution.
Memory

The peak memory required to find all the solutions will be measured

Some useful links are shown below:

  • RI Dataset: a set of graph datasets extracted from the same biological data sources used for the contest.
  • MIVIA Dataset: an extensive graph dataset having a wide variety of graphs.

Go back to the top