Health & Medicine

12 pages
7 views

Global alignment of multiple protein interaction networks with application to functional orthology detection

Please download to get full document.

View again

of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
Global alignment of multiple protein interaction networks with application to functional orthology detection
Transcript
  GLOBAL ALIGNMENT OF MULTIPLE PROTEININTERACTION NETWORKS ROHIT SINGH a JINBO XU b BONNIE BERGER a ∗ a Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology  b Toyota Technological Institute, ChicagoE-mail:  { rsingh@mit.edu, j3xu@tti-c.org, bab@mit.edu  } We describe an algorithm for global alignment of multiple protein-protein interac-tion (PPI) networks, the goal being to maximize the overall match across the inputnetworks. The intuition behind our algorithm is that a protein in one PPI networkis a good match for a protein in another network if the former’s neighbors aregood matches for the latter’s neighbors. We encode this intuition by constructingan eigenvalue problem for every pair of input networks and then using k-partitematching to extract the final global alignment across all the species. We com-pute the first known global alignment of PPI networks from five species: yeast,fly, worm, mouse and human. The global alignment immediately suggests func-tional orthologs across these species; we believe these are the first set of functionalorthologs that cover all the five species. We show that these functional orthologscompare favorably with current sequence-only orthology prediction approaches, in-cluding better prediction of orthologs for some human disease-related proteins. Supplementary Information:  http://groups.csail.mit.edu/cb/mna 1. Introduction Over the past few years, the use of high-throughput experimentaltechniques 15 , 12 for discovering protein-protein interactions (PPIs) has led toa tremendous increase in the corpus of available PPI data in various species.A useful representation of this data is as a network: each node in such anetwork corresponds to a protein and an edge between two nodes indicatesthat the corresponding proteins interact. Analysis of such PPI networks hasyielded some deep biological insights 9 . In this paper, we explore methodsfor comparing PPI networks across species. Such comparative analysis hasproven to be a valuable tool. It has led, for example, to the identificationof conserved functional components across various species, complementingtraditional sequence-only phylogenetic analysis. It also helps in identifyingerrors in experimental PPI data and in transferring annotation across species. ∗ Corresponding author. Also in the MIT Dept. of Mathematics. Pacific Symposium on Biocomputing 13:303-314(2008)  We also explore the use of such comparative analysis in improving orthol-ogy predictions across species. Identifying cross-species gene correspondences(orthologs) is a problem of fundamental biological importance— it is crucialfor transferring insights and information across species. Contributions One of the main contributions of this paper is the first algorithm for globalalignment of multiple protein interaction networks. We perform a globalalignment of PPI networks from five species: yeast, fly, worm, mouse, andhuman. We pursue the following intuition: a node in a PPI network is agood match for a node in another network if its neighbors are good matchesfor the neighbors of the other node. To formalize the intuition, we constructa set of eigenvalue problems in an approach similar to Google’s PageRank 18 algorithm and then use k-partite matching to compute the final alignment.The multiple network alignment directly leads to the first comprehensiveestimates of functional orthologs that incorporate both sequence and PPIdata and cover all the five species mentioned previously. These estimatesare more comprehensive than the two most commonly used orthology sets:Homologene 5 and Inparanoid 16 . Our list covers more genes than Homolo-gene. Unlike Inparanoid, which considers pairs of species at a time, ourmethod analyzes data from all input species simultaneously.We also introduce a novel approach,  functional coherence  , for evaluatingorthology predictions. Currently such predictions are evaluated by manuallyanalyzing selected sets of orthologs. In contrast, our automated approachmeasures the functional similarity within each set of orthologous proteinsand computes an aggregate score. Using it, we demonstrate that our algo-rithm makes predictions with slightly better overall quality than Homologeneand Inparanoid. Also, further analysis indicates that some of the improvedpredictions from our method include disease-related proteins. Related Work PPI Network Alignment:  The protein network alignment problem can beformulated either as a global or a local network alignment problem. Muchof the previous work 3 , 11 , 9 in the field has focused on the problem of localnetwork alignment (see Sec. 2). In contrast, we focus on the global alignmentproblem. Recently, we have proposed the first algorithm for  pairwise   globalalignment of PPI networks. The multiple network alignment algorithm wepresent in this paper is, we believe, the first algorithm for global alignment of multiple protein networks. While this paper builds upon some of the methodspresented in our previous work, there are also many significant differencesbetween the two problems and the corresponding algorithms (see Sec. 2). Pacific Symposium on Biocomputing 13:303-314(2008)  Functional Ortholog Prediction:  Currently, orthology prediction is usuallydone by using sequence-similarity information between various genes to esti-mate sets of genes that have descended from a common ancestor. A key chal-lenge here is to distinguish between orthologs and paralogs, the latter beinggenes that are created by duplication after the two species have diverged.We briefly describe here two commonly used orthology prediction meth-ods: Inparanoid and Homologene (see Chen  et al. 6 for more). Inparanoid 16 computes orthologs between pairs of species by making explicit assumptionsabout the relative sequence similarity scores between orthologs and paralogs.One of its drawbacks is that it is limited to pairwise orthology estimates, i.e.,it cannot analyze data from multiple species simultaneously. Homologene 5 is an approach that can simultaneously compute orthologs across multiplespecies by using sequence-similarity scores to construct a tree of proteinsand, based upon certain heuristics, grouping them into clusters of ortholo-gous genes.Recently, efforts have been made to integrate PPI data into the orthol-ogy prediction process, to identify sets of proteins that perform the samefunction. Bandyopadhyay  et al.  2 have described the use of local networkalignment results in identifying  functional orthologs   between yeast and fly.In previous work, we have described a two-way global alignment algorithmwhich directly suggests functional orthologs between yeast and fly; these pre-dictions compare favorably with Bandyopadhyay  et al. ’s. This paper is thefirst, we believe, to present functional orthologs across  multiple   species. Byintegrating data from multiple species simultaneously, we should be able toimprove upon predictions made from pairs of species. 2. Problem Formulation The input to our algorithm consists of two or more protein interaction net-works (one per species). Each input network can be represented as an undi-rected graph  G  = ( V,E  ) where  V   is the set of nodes and  E   is the set of edges.Furthermore, a confidence measure  w ( e ) (0  < w ( e )  ≤  1) may be associatedwith each edge  e  in  E  . Additionally, the input may also consist of pairwisenode similarity measures between nodes from the different networks. In thispaper, these similarity measures are BLAST Bit-values, but other scores(e.g., synteny-based scores 10 ) can also be used. Given these inputs, our goalis to find the best overall match (i.e, optimal global alignment) between theinput networks. This will directly lead to a list of functional orthologs. Local vs. Global Network Alignment:  Network alignment problemsvary in the scope of the input (two vs. multiple networks), and the kind Pacific Symposium on Biocomputing 13:303-314(2008)  XXXG 1  G 2 G 1  G 2Local Network Alignment Global Network Alignment Figure 1:  Cartoon comparing global and local network alignments : The localnetwork alignment between  G 1  and  G 2  specifies three different alignments; the mappingsfor each are marked by a different kind of line (solid, dashed, dotted). Each alignmentdescribes a small common subgraph. Local alignments need not be consistent in theirmapping– the points marked with ‘X’ each have ambiguous/inconsistent mappings underdifferent alignments. In global network alignment, the maximum common subgraph isdesired. In both cases, there are ‘gap’ nodes for which no mappings could be predicted(here, the nodes with no incident black edges are such nodes). of node-mapping desired. In general, the goal in all such problems is toidentify one or more mappings between the nodes of the input networks and,for each mapping, the corresponding set of conserved edges. A mapping maybe partial, i.e., it need not be defined for all the nodes in the networks. Eachmapping implies a common subgraph between the input networks: whenprotein  a 1  from network  G 1  is mapped to proteins  a 2  from  G 2  and  a 3  from G 3 , then  a 1 ,  a 2 , and  a 3  refer to the same node in the common subgraph;the edges in the common subgraph correspond to the conserved edges. Akey difference between our approach and many previous network alignmentapproaches is in the kind of mapping desired.Much of the previous work 3 , 11 , 9 has focused on local network alignment(LNA), i.e., on finding local regions of isomorphism (i.e., same graph struc-ture) between the input networks. Each such region implies a mapping in-dependently of others. Many independent, high-scoring local alignments areusually possible between two input networks; in fact, the corresponding lo-cal alignments need not even be mutually consistent (i.e., a protein might bemapped differently under each alignment; see Fig. 1).In contrast, we focus on the global network alignment (GNA) problem.The aim in GNA is to find the best overall alignment between the inputnetworks. The mapping in a GNA should cover all the input nodes: eachnode in an input network is either matched to one or more nodes in othernetwork(s) or explicitly marked as a gap node (i.e., with no match in anothernetwork). In contrast, a LNA algorithm outputs multiple, independent map- Pacific Symposium on Biocomputing 13:303-314(2008)  pings, each corresponding to a local region of similarity. Furthermore, thesepartial mappings may be mutually inconsistent. The mapping correspondingto a GNA is also required to be transitive: if   a 1  in  G 1  is mapped to  a 2  in  G 2 and  a 2  is mapped to nodes  a 3 ,a ′ 3  in  G 3 , then  a 1  should also be mapped to a 3 ,a ′ 3 . Our goal in GNA then is to find a comprehensive mapping betweenthe nodes of the input networks such that the size of the single correspond-ing common subgraph is maximized. Our previous work 17 contains a moredetailed comparison of the LNA and GNA problems.A key difference between the multiple-network GNA (the focus of thispaper) and pairwise GNA (the focus of our previous work 17 ) is in the scopeof the mapping desired. In the latter, we required that a node may bemapped to at most one node in the other network, the motivation being tofind the  best   match for a node. In contrast, for the multiple networks casewe allow for a node to map to multiple nodes in another network. Thisis necessary because gene duplication, mutation, and deletion events mightmake it impossible to find a valid one-to-one, transitive mapping betweenproteins across an arbitrary collection of species. 3. Algorithm To describe a global alignment between input networks, we need to specify anode mapping between the input networks and the corresponding commonsubgraph. We focus on the computing the node mapping, since the subgraphcan be easily computed once the former is known.Our algorithm works in two stages. First, given  k  input networks, wecreate a k-partite graph H . Each of its  k  parts contains nodes from one of theinput networks. Edges are only allowed between nodes from different parts.The presence of an edge  e ij  implies that node  i  (from  G 1 ) can potentiallybe mapped to  j  (from  G 2 ); the edge-weight  R ij  indicates the strength of thepotential match. In the second stage, we perform k-partite matching on  H to group nodes into clusters. All nodes in a cluster are then mapped to eachother in the corresponding GNA. First Stage (Creating the k-partite graph):  We start with the  k  inputPPI networks and sequence similarity scores between the nodes. For everypair of input networks, we compute a score for every possible pairing betweenthe nodes of the two networks. Let  R ij  ( R ij  ≥ 0) be the score for the proteinpair ( i,j ) where  i  is from network  G 1  and  j  is from network  G 2 . Intuitively, R ij  should capture how good a match  i  and  j  are: higher  R ij  implies a bettermatch. In the second stage, we will use these scores to guide our algorithmtowards the optimal k-partite matching of   H . Pacific Symposium on Biocomputing 13:303-314(2008)
Related Documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x