Screenplays & Play

6 pages
4 views

An information-based sequence distance and its application to whole mitochondrial genome phylogeny

of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
An information-based sequence distance and its application to whole mitochondrial genome phylogeny
Transcript
  BIOINFORMATICS   Vol. 17 no. 2 2001Pages 149–154  An information-based sequence distance and its application to whole mitochondrial genome phylogeny  Ming Li  1, ∗ , Jonathan H. Badger  1 , Xin Chen  2  , Sam Kwong  2  ,Paul Kearney  1 and Haoyong Zhang  1 1 Bioinformatics Laboratory, Computer Science Department, University of Waterloo,N2L 3G1, Canada and   2  Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong  Received on July 19, 2000; revised on October 5, 2000; accepted on October 11, 2000 ABSTRACTMotivation:  Traditional sequence distances require analignment and therefore are not directly applicable tothe problem of whole genome phylogeny where eventssuch as rearrangements make full length alignmentsimpossible. We present a sequence distance that workson unaligned sequences using the information theoreticalconcept of Kolmogorov complexity and a program toestimate this distance. Results:  We establish the mathematical foundations ofour distance and illustrate its use by constructing a phy-logeny of the Eutherian orders using complete unalignedmitochondrial genomes. This phylogeny is consistent withthe commonly accepted one for the Eutherians. A second,larger mammalian dataset is also analyzed, yieldinga phylogeny generally consistent with the commonlyaccepted one for the mammals. Availability:  The program to estimate our sequencedistance, is available at http://www.cs.cityu.edu.hk/  ∼ cssamk/gencomp/GenCompress1.htm. The distancematrices used to generate our phylogenies are availableat http://www.math.uwaterloo.ca/  ∼ mli/distance.html Contact:  mli@wh.math.uwaterloo.ca INTRODUCTION The fast advance of worldwide genome sequencingprojects has raised a fundamental and challenging ques-tion to modern biological science: how do we comparetwo genomes? In fact, it earned the first position in tworecent lists of major open problems in bioinformatics(Koonin, 1999; Wooley, 1999). Existing tools and meth- ods such as multiple alignment and various sequenceevolutionary models do not directly apply to completegenomes where such events as rearrangements make ∗ To whom correspondence should be addressed. traditional full length alignments impossible, and nostatistical models currently exist for the evolution of complete genomes.In the absence of such models, a method which cancompute the shared information between two sequences isuseful because biological sequences encode information,and the occurrence of evolutionary events (such as in-sertions, deletions, point mutations, rearrangements, andinversions) separating two sequences sharing a commonancestor will result in the loss of their shared information.Regions of sequences which do not share a commonancestor will not share more information than would beexpected at random. Here we present a mathematicallyrigorous universal distance based on shared algorithmicinformation; a fully automated and accurate softwaretool based on such distance to compare two genomes,or two English texts for that matter; and we demonstratethat whole mitochondrial genome phylogenies can bereconstructed automatically from  unaligned   completemitochondrial genomes by our software. METHODS Definition of distance We begin by defining what we mean by a distance.Without loss of generality, a distance only needs to operateon sequences of 0s and 1s since any sequence can berepresented by a binary sequence. We also only considernormalized distance functions  d   such that 0    d  (  x  ,  y )   1 for all sequences  x   and  y . For a function  d   to be a‘distance’, it must satisfy (a)  d  (  x  ,  y ) >  0 for  x   =  y ;(b)  d  (  x  ,  x  )  =  0; (c)  d  (  x  ,  y )  =  d  (  y ,  x  )  (symmetric); and(d)  d  (  x  ,  y )  d  (  x  ,  z ) + d  (  z ,  y )  (triangle inequality).Given two sequences  x   and  y , our new distance  d  (  x  ,  y ) is defined as follows d  (  x  ,  y )  =  1 − K  (  x  ) −  K  (  x  |  y ) K  (  xy ),  (1) c   Oxford University Press 2001  149  M.Li et al. where  K  (  x  |  y )  is the conditional Kolmogorov complexity(or algorithmic entropy) of   x   given  y .  K  (  x  |  y )  is definedas the length of the shortest program causing a standarduniversal computer to output  x   on input  y , and  K  (  x  )  isdefined as  K  (  x  | ǫ) , where  ǫ  is the empty string. See Liand Vit´anyi (1997) for formal definitions of Kolmogorovcomplexity and its successful applications in physics,mathematics, and computer science.  K  (  x  |  y )  measuresthe randomness of   x   given  y . The numerator  K  (  x  )  − K  (  x  |  y )  is the amount of information  y  knows about  x  . It is a deep theorem in Kolmogorov complexity that K  (  x  )  −  K  (  x  |  y )  ≈  K  (  y )  −  K  (  y |  x  )  (Li and Vit´anyi,1997). That is, the information  y  knows about  x   is equalto the information  x   knows about  y , for all  x   and  y .This is called mutual algorithmic information between  x   and  y . The denominator  K  (  xy ) , being the amount of information in the string  x   concatenated with  y , servesas a normalization factor such that the distance  d  (  x  ,  y ) ranges between 0 (when  y  ‘knows’ all about  x  ) and1 (when  y  ‘knows’ nothing about  x  ). However, mutualalgorithmic information is not itself a distance and itdoes not satisfy the triangle inequality. Clearly,  d   satisfiesdistance conditions (a), (b), (c). It is not obvious that italso satisfies (d). The following theorem answers this. All(in)equalities are modulo an additive  O ( log n )  term.T HEOREM  1.  d  (  x  ,  y )  satisfies the triangle inequality,that is, d  (  x  ,  z )  d  (  x  ,  y ) + d  (  y ,  z ) . P ROOF . We need to show:1 − K  (  x  ) −  K  (  x  |  z ) K  (  xz )  1 − K  (  x  ) −  K  (  x  |  y ) K  (  xy ) + 1 − K  (  y ) −  K  (  y |  z ) K  (  yz ). This is equivalent to, by the Symmetry of Informationtheorem (Li and Vit´anyi, 1997), K  (  z |  x  ) +  K  (  x  |  z ) K  (  xz )   K  (  y |  x  ) +  K  (  x  |  y ) K  (  xy ) + K  (  z |  y ) +  K  (  y |  z ) K  (  yz ). It is sufficient to prove the following two inequalities: K  (  x  |  z ) K  (  xz )   K  (  x  |  y ) K  (  xy ) + K  (  y |  z ) K  (  yz ) K  (  z |  x  ) K  (  xz )   K  (  y |  x  ) K  (  xy ) + K  (  z |  y ) K  (  yz ). The two inequalities are symmetric. To prove the first,let  r   =  K  (  x  |  z ) ,  q  =  K  (  x  |  y ) ,  p  =  K  (  y |  z ) . Obviously r     p + q , by elementary facts in Kolmogorov complexity(Li and Vit´anyi, 1997). Let  r   =  p + q  −  . Then, K  (  x  |  z ) K  (  xz )   r K  (  z ) + r    p + q  −  K  (  z ) +  p + q  −    p + qK  (  z ) +  p + q   qK  (  z ) +  p + q +  pK  (  z ) +  p + q   qK  (  xy ) +  pK  (  yz ) = K  (  x  |  y ) K  (  xy ) + K  (  y |  z ) K  (  yz ). This proves the first inequality. The second is provedsymmetrically.   While it is true that when the quantities of Kolmogorovcomplexity terms in the above proof are extremely small(non-interesting case), then the above proof needs to bemore carefully formulated (Vit´anyi,P. personal communi-cation). But to maintain clarity, we have chosen not to dothat, although a similar proof still works. UNIVERSALITY Now, consider any computable distance  D . In order toexclude degenerate distances such as  D (  x  ,  y )  =  1 / 2 forall sequences  x   and  y , we limit the number of sequencesin a neighborhood of size  d  . Let us require for each  x  , |{  y  : |  y | =  n  and  D (  x  ,  y )  d  }|  2 dn .  (2)Assuming equation (2), we prove the following theorem.T HEOREM  2.  For any computable distance D, there isa constant c  <  2  such that, with probability 1, for allsequences x and y, d  (  x  ,  y )  cD (  x  ,  y ) . P ROOF . By definition 1 and the Symmetry of Infor-mation theorem, as in Theorem 1, we know that, up to O ( log n )  factor, d  (  x  ,  y )  = K  (  x  |  y ) +  K  (  y |  x  ) K  (  xy ).  (3)Given  D , using the density property in formula 2 andthe computable function  D , we know that  K  (  x  |  y )    D (  x  ,  y ) n , and  K  (  y |  x  )    D (  x  ,  y ) n . With probability 1, K  (  xy ) >  n , noting  |  x  | =  n  or  |  y | =  n . Thus, equation 3gives d  (  x  ,  y )  = K  (  x  |  y ) +  K  (  y |  x  ) K  (  xy )   2  D (  x  ,  y ) nn =  2  D (  x  ,  y ), with probability 1. This proves the theorem with  c  2.  150  Information-based sequence distance and mitochondrial genome phylogeny  This demonstrates mathematically the following: if   x  and  y  are ‘close’ according to distance measure  D , thenthey are also ‘close’ according to our new distance  d  . Inother words, if   D  reveals some similarity between  x   and  y , so does  d  .R EMARK  1.  Our distance d  (  x  ,  y )  naturally takes careof some tricky problems which plague other stochasticmeasures. For example, some methods might tend tocluster all sequences with high G  +  C content together.Since d  (  x  ,  y )  only measures the shared information, thebase composition bias gets naturally self canceled in theK  (  x  ) −  K  (  x  |  y )  calculation (in our approximation, whichwill be described in the next section, this is done byarithmetic encoding). Therefore, unlike many stochasticmeasures, our distance does not require that the sequencesobeythestationaritycondition.Thismayalsobeviewedasa consequence of the universality statement. RESULTS Whole genome phylogeny With many genomes already sequenced, imminent com-pletion of the human genome and the completion of manyother sequencing projects on the horizon, whole genomeanalysis and especially pairwise genome comparison isa great challenge in genomics (Koonin, 1999; Wooley,1999). Already there have been proposals to comparegenomes using gene order (Boore and Brown, 1998)and gene content (Fitz-Gibbon and House, 1999; Snel et al. , 1999). Such comparisons are time consuming asthey require gene identification. These distances, togetherwith G + C content; edit distance; and reversal andrearrangement distances (Kececioglu and Sankoff, 1995;Hannenhalli and Pevzner, 1995; Nadeau and Sankoff,1998) compare genomes using only partial genomeinformation whereas our new distance uses all genomeinformation. The transformation distance (Varre  et al. ,1998) and compression distance (Grumbach and Tahi,1994) are essentially defined as  K  (  x  |  y )  which is badlyasymmetric, and so, is not a distance. Only when allsequences are random, which is not the case for DNAsequences do these two distances coincide with our newdistance. In fact, mathematically, all above distances canbe formulated as special cases of our new distance.Kolmogorov complexity can be thought of as theultimate lower bound of all measures of information andcannot be computed in the general case (Li and Vit´anyi,1997). Therefore, our new distance must be approximated.For this purpose, we use the program  GenCompress (Chen  et al. , 2000) which is currently the best compres-sion program for DNA sequences, achieving the bestcompression ratios for benchmark DNA sequences, toheuristically approximate  K  (  x  |  y ) ,  K  (  x  ) , and  K  (  xy ) . GenCompress  finds approximate matches (hence editdistance becomes a special case), approximate reversecomplements, among other things, with arithmetic encod-ing when necessary.  GenCompress  was implemented byChen  et al.  (2000) and can be downloaded from our web-site. It is capable of compressing in a day the complete,34 megabase human chromosome 22, (achieving 12%compression).To test our theory, we have performed several experi-ments on the construction of whole genome phylogeniesusing our new distance, all with very encouraging results.Two such experiments follow. Phylogeny of Eutherian orders It has been debated which two of the three main groupsof placental mammals are more closely related: Pri-mates, Ferungulates, and Rodents. This is because bythe maximum likelihood method, some proteins supportthe (Ferungulates, (Primates, Rodents)) grouping whileother proteins support the (Rodents, (Ferungulates, Pri-mates)) grouping (Cao  et al. , 1998). Cao  et al.  aligned 12concatenated mitochondrial proteins from the followingspecies: rat (  Rattus norvegicus ), house mouse (  Musmusculus ), grey seal (  Halichoerus grypus ), harbor seal( Phoca vitulina ), cat ( Felis catus ), white rhino ( Cera-totherium simum ), horse (  Equus caballus ), finback whale(  Balaenoptera physalus ), blue whale (  Balaenoptera mus-culus ), cow (  Bos taurus ), gibbon (  Hylobates lar  ), gorilla( Gorilla gorilla ), human (  Homo sapiens ), chimpanzee( Pan troglodytes ), pygmy chimpanzee ( Pan paniscus ),orangutan ( Pongo pygmaeus ), Sumatran orangutan( Pongo pygmaeus abelii ), using opossum (  Didelphisvirginiana ), wallaroo (  Macropus robustus ) and platypus( Ornithorhynchus anatinus ) as the outgroup, and builtthe maximum likelihood tree to confirm the grouping(Rodents, (Primates, Ferungulates)). Using the completemitochondrial genomes of these species we computedour new distance  d  (  x  ,  y )  between each pair of species  x   and  y  and constructed a tree (Figure 1) using theneighbor joining (Saitou and Nei, 1987) program inthe MOLPHY package (Adachi and Hasegawa, 1996).The tree is identical to the maximum likelihood tree of Cao  et al. Because neighbor-joining is sometimes distrusted(Hillis  et al. , 1994; Kuhner and Felsenstein, 1994), tofurther corroborate this grouping we applied our ownhypercleaning program (Bryant  et al. , 2000) to the samedistance matrix and obtained the same tree. The hyper-cleaning program constructs an evolutionary tree usingthe edges best supported by all possible four taxa sub-trees (commonly called ‘quartets’). Thus using the newinformation-theoretic distances derived from the  complete mtDNA genomes we have re-confirmed the hypothesisof (Rodents, (Primates, Ferungulates)). The distancematrix can be found at our website (see the abstract). 151  M.Li et al. PlatypusWallarooOpossumOpossumRatHouseMouseCatHarborSealGreySealWhiteRhinoHorseFinbackWhaleBlueWhaleCowGibbonGorillaHumanPygmyChimpanzeeChimpanzeeOrangutanSumatranOrangutan 100100100981001009989100100100 100100Rodents 100Ferungulates 100Primates 100Marsupials and monotremes 96 Fig. 1.  The evolutionary tree built from the complete mammalianmtDNA sequences of the taxa analyzed in Cao  et al.  (1998). Thenumbers associated with each clade are the percentage of quartetssupporting the grouping according to the hypercleaning algorithm(Bryant  et al. , 2000), and can be interpreted in a similar fashion tobootstrap values. The simple asymmetric measure  K  (  x  |  y )  leads to a wrongtree using the same data and programs, as expected (datanot shown). The gene order (Boore and Brown, 1998) andgene content (Fitz-Gibbon and House, 1999; Snel  et al. ,1999) approaches, while yielding symmetric distances,have the disadvantage of requiring laborious humananalysis of the sequences and also are unlikely to provideenough information to distinguish closely related speciessuch as the above data set. Our method is fully automaticand utilizes the information contained in noncodingregions in addition to the information contained in thegenes.To further assure our result, we have extracted thecoding regions only from mtDNAs of the above species,and performed the same computation. We have obtainedthe same tree. Phylogenetic position of the rodents We also analyzed a larger dataset derived from the 34-taxamitochondrial genome phylogeny in Reyes  et al.  (2000). 9998979997969099 Fig. 2.  The evolutionary tree built from the complete mammalianmtDNA sequences of the taxa analyzed in Reyes  et al.  (2000). Thenumbers associated with each clade are the percentage of quartetssupporting the grouping according to the hypercleaning algorithm(Bryant  et al. , 2000), and can be interpreted in a similar fashion tobootstrap values. Only values less than 100 are shown. This dataset included 19 of the 20 taxa in Cao  et al. (1998) and the additional 15 taxa: aardvark ( Orycteropusafer  ), armadillo (  Dasypus novemcintus ), baboon ( Papiohamadryas ), dog ( Canis familiaris ), donkey (  Equusasinus ), dormouse ( Glis glis ), elephant (  Loxodontaafricana ), fruit bat (  Artibeus jamaicensis ), great rhino(  Rhinoceros unicornis ), guinea pig ( Cavia porcellus ),hedgehog (  Erinaceus europaeus ), hippo (  Hippopotamusamphibius ), pig ( Sus scrofa ), rabbit ( Oryctolagus cunicu-lus ), sheep ( Ovis aries ), and squirrel ( Sciurus vulgaris ).This denser, more diverse, and more controversial datasetyielded a distance matrix, which, when analyzed byneighbor-joining and hypercleaning, resulted in twosomewhat different phylogenies. The consensus of thesetwo phylogenies is presented in Figure 2. 152  Information-based sequence distance and mitochondrial genome phylogeny  While this consensus agrees with the overall structureof the phylogeny presented in Reyes  et al.  (2000),including the possible nonmonophyleticity of the rodents,several discrepancies exist. In particular, our methodgroups the pig with the perisodactyls rather than withthe cetartiodactyls, the carnivores form a ferungulateoutgroup, and the guinea pig groups with neither themurid nor the nonmurid rodents. However, neither of theselatter two discrepancies are unreasonable hypotheses. Theoutgroup status of the carnivores has been suggested byGraur  et al.  (1997), and the phylogenetic position of theguinea pig is one of the most controversial topics insystematic biology (Graur  et al. , 1991; D’Erchia  et al. ,1996; Cao  et al. , 1997; Sullivan and Swofford, 1997;Reyes  et al. , 1998). CONCLUSIONS Our goal in this paper is not to confirm or refute previousphylogenetic studies but rather to bring a new methodol-ogy and a new tool to the comparative genomics researchcommunity. Our new method for whole genome compari-son and phylogeny does not require gene identification norany human intervention, in fact, it is totally automatic. Itis mathematically well-founded being based on general in-formation theoretic concepts. It works when there are noagreed upon evolutionary models, as further demonstratedby the successful construction of a chain letter phylogeny(manuscript in preparation by Bennett,C.H., Li,M., andMa,B.), and when individual gene trees do not agree (e.g.Cao  et al. , 1998) as is the case for genomes. Another pos-sible use of our method is as an evaluator of alignments,as alignments with little shared information are unlikely toyield meaningful phylogenies by any method. Althougha possible criticism of our method is that it is based oninformation theory rather than a biological model, it isworth stressing that the alignment algorithms that biolo-gists use today are in fact also based on information the-ory, although less rigorously.The method depends on the ability to efficiently andsharply approximate shared information, as this informa-tion is not computable. Our preliminary experiments haveshown that our estimation approach is fruitful. However,to improve our results we need a better conditional en-tropy estimator of DNA sequences, as distances betweenhighly divergent sequences tend to be similar to each otherby our current estimator, making deep branches in phy-logenies difficult to resolve. Still, our experiments havedemonstrated clearly that the method is useful even withour current estimator. With a sharper conditional entropyestimator, we believe that this method will accurately re-cover whole organismal genome phylogenies as well. ACKNOWLEDGEMENTS We thank Ford Doolittle, Brian Golding, MasamiHasegawa, and Huaichun Wang for providing very usefulinformation, and Tao Jiang, Pavel Pevzner, David Sankoff and Huaichum Wang. We especially thank Paul Vit´anyi,who pointed out several loopholes in an earlier versionof this paper, and suggested the new formulation of Theorem 2. A referee suggested Reyes  et al.  (2000) datato us. J.H.B. was supported by a CITO grant. X.C., S.K.,and M.L. were supported in part by a CityU researchgrant 7000875, P.K. was supported by NSERC ResearchGrant 160321 and a CITO grant, M.L. was also supportedin part by NSERC Research Grant OGP0046506, a CITOgrant, and an NSERC Steacie Fellowship. H.Z. wassupported by NSERC Research Grants OGP0046506 and160321. REFERENCES Adachi,J. and Hasegawa,M. (1996) MOLPHY version 2.3: pro-grams for molecular phylogenetics based on maximum likeli-hood.  Comput. Sci. Monogr. Inst. Stat. Math. ,  28 , 1–150.Boore,J.L. and Brown,W.M. (1998) Big trees from little genomes:mitochondrial gene order as a phylogenetic tool.  Curr. Opin.Genet. Dev. ,  8 , 668–674.Bryant,D., Berry,V., Kearney,P., Li,M., Jiang,T., Wareham,T. andZhang,H. (2000) A practical algorithm for recovering the bestsupported edges of an evolutionary tree. In  Proceedingsof the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms .Cao,Y., Janke,A., Waddell,P.J., Westerman,M., Takenaka,O., Mu-rata,S., Okada,N., P¨a¨abo,S. and Hasegawa,M. (1998) Conflictamong individual mitochondrial proteins in resolving the phy-logeny of eutherian orders.  J. Mol. Evol. ,  47 , 307–322.Cao,Y., Okada,N. and Hasegawa,M. (1997) Phylogenetic positionof guinea pigs revisited.  Mol. Biol. Evol. ,  14 , 461–464.Chen,X., Kwong,S. and Li,M. (2000) A compression algorithm forDNA sequences based on approximate matching. In Shamir,R.,Miyano,S., Istrail,S., Pevzner,P. and Waterman,M. (eds),  Pro-ceedings of the Fourth Annual International Conference on Com- putational Molecular Biology (RECOMB) . Association for Com-puting Machinery, Tokyo, Japan, pp. 107.D’Erchia,A.M., Gissi,C., Pesole,G., Saccone,C. and Arnason,U.(1996) The guinea pig is not a rodent.  Nature ,  381 , 597–599.Fitz-Gibbon,S.T. and House,C.H. (1999) Whole genome-basedphylogenetic analysis of free-living microorganisms.  Nucleic Acids Res. ,  27 , 4218–4222.Graur,D., Gouy,M. and Duret,L. (1997) Evolutionary affinities of the order Perissodactyla and the phylogenetic status of thesuperordinal taxa Ungulata and Altungulata.  Mol. Phylogenet. Evol. ,  7 , 195–200.Graur,D.,Hide,W.A.andLi,W.-H.(1991)Istheguineapigarodent?  Nature ,  351 , 649–652.Grumbach,S. and Tahi,F. (1994) A new challenge for compressionalgorithms: genetic sequences.  J. Info. Process. Manage. ,  30 ,875–866. 153
Related Documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x