BIOINFORMATICS
Vol. 17 no. 2 2001Pages 149–154
An informationbased sequence distance and its application to whole mitochondrial genome phylogeny
Ming Li
1,
∗
, Jonathan H. Badger
1
, Xin Chen
2
, Sam Kwong
2
,Paul Kearney
1
and Haoyong Zhang
1
1
Bioinformatics Laboratory, Computer Science Department, University of Waterloo,N2L 3G1, Canada and
2
Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
Received on July 19, 2000; revised on October 5, 2000; accepted on October 11, 2000
ABSTRACTMotivation:
Traditional sequence distances require analignment and therefore are not directly applicable tothe problem of whole genome phylogeny where eventssuch as rearrangements make full length alignmentsimpossible. We present a sequence distance that workson unaligned sequences using the information theoreticalconcept of Kolmogorov complexity and a program toestimate this distance.
Results:
We establish the mathematical foundations ofour distance and illustrate its use by constructing a phylogeny of the Eutherian orders using complete unalignedmitochondrial genomes. This phylogeny is consistent withthe commonly accepted one for the Eutherians. A second,larger mammalian dataset is also analyzed, yieldinga phylogeny generally consistent with the commonlyaccepted one for the mammals.
Availability:
The program to estimate our sequencedistance, is available at http://www.cs.cityu.edu.hk/
∼
cssamk/gencomp/GenCompress1.htm. The distancematrices used to generate our phylogenies are availableat http://www.math.uwaterloo.ca/
∼
mli/distance.html
Contact:
mli@wh.math.uwaterloo.ca
INTRODUCTION
The fast advance of worldwide genome sequencingprojects has raised a fundamental and challenging question to modern biological science: how do we comparetwo genomes? In fact, it earned the ﬁrst position in tworecent lists of major open problems in bioinformatics(Koonin, 1999; Wooley, 1999). Existing tools and meth
ods such as multiple alignment and various sequenceevolutionary models do not directly apply to completegenomes where such events as rearrangements make
∗
To whom correspondence should be addressed.
traditional full length alignments impossible, and nostatistical models currently exist for the evolution of complete genomes.In the absence of such models, a method which cancompute the shared information between two sequences isuseful because biological sequences encode information,and the occurrence of evolutionary events (such as insertions, deletions, point mutations, rearrangements, andinversions) separating two sequences sharing a commonancestor will result in the loss of their shared information.Regions of sequences which do not share a commonancestor will not share more information than would beexpected at random. Here we present a mathematicallyrigorous universal distance based on shared algorithmicinformation; a fully automated and accurate softwaretool based on such distance to compare two genomes,or two English texts for that matter; and we demonstratethat whole mitochondrial genome phylogenies can bereconstructed automatically from
unaligned
completemitochondrial genomes by our software.
METHODS
Deﬁnition of distance
We begin by deﬁning what we mean by a distance.Without loss of generality, a distance only needs to operateon sequences of 0s and 1s since any sequence can berepresented by a binary sequence. We also only considernormalized distance functions
d
such that 0
d
(
x
,
y
)
1 for all sequences
x
and
y
. For a function
d
to be a‘distance’, it must satisfy (a)
d
(
x
,
y
) >
0 for
x
=
y
;(b)
d
(
x
,
x
)
=
0; (c)
d
(
x
,
y
)
=
d
(
y
,
x
)
(symmetric); and(d)
d
(
x
,
y
)
d
(
x
,
z
)
+
d
(
z
,
y
)
(triangle inequality).Given two sequences
x
and
y
, our new distance
d
(
x
,
y
)
is deﬁned as follows
d
(
x
,
y
)
=
1
−
K
(
x
)
−
K
(
x

y
)
K
(
xy
),
(1)
c
Oxford University Press 2001
149
M.Li et al.
where
K
(
x

y
)
is the conditional Kolmogorov complexity(or algorithmic entropy) of
x
given
y
.
K
(
x

y
)
is deﬁnedas the length of the shortest program causing a standarduniversal computer to output
x
on input
y
, and
K
(
x
)
isdeﬁned as
K
(
x

ǫ)
, where
ǫ
is the empty string. See Liand Vit´anyi (1997) for formal deﬁnitions of Kolmogorovcomplexity and its successful applications in physics,mathematics, and computer science.
K
(
x

y
)
measuresthe randomness of
x
given
y
. The numerator
K
(
x
)
−
K
(
x

y
)
is the amount of information
y
knows about
x
. It is a deep theorem in Kolmogorov complexity that
K
(
x
)
−
K
(
x

y
)
≈
K
(
y
)
−
K
(
y

x
)
(Li and Vit´anyi,1997). That is, the information
y
knows about
x
is equalto the information
x
knows about
y
, for all
x
and
y
.This is called mutual algorithmic information between
x
and
y
. The denominator
K
(
xy
)
, being the amount of information in the string
x
concatenated with
y
, servesas a normalization factor such that the distance
d
(
x
,
y
)
ranges between 0 (when
y
‘knows’ all about
x
) and1 (when
y
‘knows’ nothing about
x
). However, mutualalgorithmic information is not itself a distance and itdoes not satisfy the triangle inequality. Clearly,
d
satisﬁesdistance conditions (a), (b), (c). It is not obvious that italso satisﬁes (d). The following theorem answers this. All(in)equalities are modulo an additive
O
(
log
n
)
term.T
HEOREM
1.
d
(
x
,
y
)
satisﬁes the triangle inequality,that is, d
(
x
,
z
)
d
(
x
,
y
)
+
d
(
y
,
z
)
.
P
ROOF
. We need to show:1
−
K
(
x
)
−
K
(
x

z
)
K
(
xz
)
1
−
K
(
x
)
−
K
(
x

y
)
K
(
xy
)
+
1
−
K
(
y
)
−
K
(
y

z
)
K
(
yz
).
This is equivalent to, by the Symmetry of Informationtheorem (Li and Vit´anyi, 1997),
K
(
z

x
)
+
K
(
x

z
)
K
(
xz
)
K
(
y

x
)
+
K
(
x

y
)
K
(
xy
)
+
K
(
z

y
)
+
K
(
y

z
)
K
(
yz
).
It is sufﬁcient to prove the following two inequalities:
K
(
x

z
)
K
(
xz
)
K
(
x

y
)
K
(
xy
)
+
K
(
y

z
)
K
(
yz
)
K
(
z

x
)
K
(
xz
)
K
(
y

x
)
K
(
xy
)
+
K
(
z

y
)
K
(
yz
).
The two inequalities are symmetric. To prove the ﬁrst,let
r
=
K
(
x

z
)
,
q
=
K
(
x

y
)
,
p
=
K
(
y

z
)
. Obviously
r
p
+
q
, by elementary facts in Kolmogorov complexity(Li and Vit´anyi, 1997). Let
r
=
p
+
q
−
. Then,
K
(
x

z
)
K
(
xz
)
r K
(
z
)
+
r
p
+
q
−
K
(
z
)
+
p
+
q
−
p
+
qK
(
z
)
+
p
+
q
qK
(
z
)
+
p
+
q
+
pK
(
z
)
+
p
+
q
qK
(
xy
)
+
pK
(
yz
)
=
K
(
x

y
)
K
(
xy
)
+
K
(
y

z
)
K
(
yz
).
This proves the ﬁrst inequality. The second is provedsymmetrically.
While it is true that when the quantities of Kolmogorovcomplexity terms in the above proof are extremely small(noninteresting case), then the above proof needs to bemore carefully formulated (Vit´anyi,P. personal communication). But to maintain clarity, we have chosen not to dothat, although a similar proof still works.
UNIVERSALITY
Now, consider any computable distance
D
. In order toexclude degenerate distances such as
D
(
x
,
y
)
=
1
/
2 forall sequences
x
and
y
, we limit the number of sequencesin a neighborhood of size
d
. Let us require for each
x
,
{
y
: 
y
 =
n
and
D
(
x
,
y
)
d
}
2
dn
.
(2)Assuming equation (2), we prove the following theorem.T
HEOREM
2.
For any computable distance D, there isa constant c
<
2
such that, with probability 1, for allsequences x and y, d
(
x
,
y
)
cD
(
x
,
y
)
.
P
ROOF
. By deﬁnition 1 and the Symmetry of Information theorem, as in Theorem 1, we know that, up to
O
(
log
n
)
factor,
d
(
x
,
y
)
=
K
(
x

y
)
+
K
(
y

x
)
K
(
xy
).
(3)Given
D
, using the density property in formula 2 andthe computable function
D
, we know that
K
(
x

y
)
D
(
x
,
y
)
n
, and
K
(
y

x
)
D
(
x
,
y
)
n
. With probability 1,
K
(
xy
) >
n
, noting

x
 =
n
or

y
 =
n
. Thus, equation 3gives
d
(
x
,
y
)
=
K
(
x

y
)
+
K
(
y

x
)
K
(
xy
)
2
D
(
x
,
y
)
nn
=
2
D
(
x
,
y
),
with probability 1. This proves the theorem with
c
2.
150
Informationbased sequence distance and mitochondrial genome phylogeny
This demonstrates mathematically the following: if
x
and
y
are ‘close’ according to distance measure
D
, thenthey are also ‘close’ according to our new distance
d
. Inother words, if
D
reveals some similarity between
x
and
y
, so does
d
.R
EMARK
1.
Our distance d
(
x
,
y
)
naturally takes careof some tricky problems which plague other stochasticmeasures. For example, some methods might tend tocluster all sequences with high G
+
C content together.Since d
(
x
,
y
)
only measures the shared information, thebase composition bias gets naturally self canceled in theK
(
x
)
−
K
(
x

y
)
calculation (in our approximation, whichwill be described in the next section, this is done byarithmetic encoding). Therefore, unlike many stochasticmeasures, our distance does not require that the sequencesobeythestationaritycondition.Thismayalsobeviewedasa consequence of the universality statement.
RESULTS
Whole genome phylogeny
With many genomes already sequenced, imminent completion of the human genome and the completion of manyother sequencing projects on the horizon, whole genomeanalysis and especially pairwise genome comparison isa great challenge in genomics (Koonin, 1999; Wooley,1999). Already there have been proposals to comparegenomes using gene order (Boore and Brown, 1998)and gene content (FitzGibbon and House, 1999; Snel
et al.
, 1999). Such comparisons are time consuming asthey require gene identiﬁcation. These distances, togetherwith G
+
C content; edit distance; and reversal andrearrangement distances (Kececioglu and Sankoff, 1995;Hannenhalli and Pevzner, 1995; Nadeau and Sankoff,1998) compare genomes using only partial genomeinformation whereas our new distance uses all genomeinformation. The transformation distance (Varre
et al.
,1998) and compression distance (Grumbach and Tahi,1994) are essentially deﬁned as
K
(
x

y
)
which is badlyasymmetric, and so, is not a distance. Only when allsequences are random, which is not the case for DNAsequences do these two distances coincide with our newdistance. In fact, mathematically, all above distances canbe formulated as special cases of our new distance.Kolmogorov complexity can be thought of as theultimate lower bound of all measures of information andcannot be computed in the general case (Li and Vit´anyi,1997). Therefore, our new distance must be approximated.For this purpose, we use the program
GenCompress
(Chen
et al.
, 2000) which is currently the best compression program for DNA sequences, achieving the bestcompression ratios for benchmark DNA sequences, toheuristically approximate
K
(
x

y
)
,
K
(
x
)
, and
K
(
xy
)
.
GenCompress
ﬁnds approximate matches (hence editdistance becomes a special case), approximate reversecomplements, among other things, with arithmetic encoding when necessary.
GenCompress
was implemented byChen
et al.
(2000) and can be downloaded from our website. It is capable of compressing in a day the complete,34 megabase human chromosome 22, (achieving 12%compression).To test our theory, we have performed several experiments on the construction of whole genome phylogeniesusing our new distance, all with very encouraging results.Two such experiments follow.
Phylogeny of Eutherian orders
It has been debated which two of the three main groupsof placental mammals are more closely related: Primates, Ferungulates, and Rodents. This is because bythe maximum likelihood method, some proteins supportthe (Ferungulates, (Primates, Rodents)) grouping whileother proteins support the (Rodents, (Ferungulates, Primates)) grouping (Cao
et al.
, 1998). Cao
et al.
aligned 12concatenated mitochondrial proteins from the followingspecies: rat (
Rattus norvegicus
), house mouse (
Musmusculus
), grey seal (
Halichoerus grypus
), harbor seal(
Phoca vitulina
), cat (
Felis catus
), white rhino (
Ceratotherium simum
), horse (
Equus caballus
), ﬁnback whale(
Balaenoptera physalus
), blue whale (
Balaenoptera musculus
), cow (
Bos taurus
), gibbon (
Hylobates lar
), gorilla(
Gorilla gorilla
), human (
Homo sapiens
), chimpanzee(
Pan troglodytes
), pygmy chimpanzee (
Pan paniscus
),orangutan (
Pongo pygmaeus
), Sumatran orangutan(
Pongo pygmaeus abelii
), using opossum (
Didelphisvirginiana
), wallaroo (
Macropus robustus
) and platypus(
Ornithorhynchus anatinus
) as the outgroup, and builtthe maximum likelihood tree to conﬁrm the grouping(Rodents, (Primates, Ferungulates)). Using the completemitochondrial genomes of these species we computedour new distance
d
(
x
,
y
)
between each pair of species
x
and
y
and constructed a tree (Figure 1) using theneighbor joining (Saitou and Nei, 1987) program inthe MOLPHY package (Adachi and Hasegawa, 1996).The tree is identical to the maximum likelihood tree of Cao
et al.
Because neighborjoining is sometimes distrusted(Hillis
et al.
, 1994; Kuhner and Felsenstein, 1994), tofurther corroborate this grouping we applied our ownhypercleaning program (Bryant
et al.
, 2000) to the samedistance matrix and obtained the same tree. The hypercleaning program constructs an evolutionary tree usingthe edges best supported by all possible four taxa subtrees (commonly called ‘quartets’). Thus using the newinformationtheoretic distances derived from the
complete
mtDNA genomes we have reconﬁrmed the hypothesisof (Rodents, (Primates, Ferungulates)). The distancematrix can be found at our website (see the abstract).
151
M.Li et al.
PlatypusWallarooOpossumOpossumRatHouseMouseCatHarborSealGreySealWhiteRhinoHorseFinbackWhaleBlueWhaleCowGibbonGorillaHumanPygmyChimpanzeeChimpanzeeOrangutanSumatranOrangutan
100100100981001009989100100100 100100Rodents 100Ferungulates 100Primates 100Marsupials and monotremes 96
Fig. 1.
The evolutionary tree built from the complete mammalianmtDNA sequences of the taxa analyzed in Cao
et al.
(1998). Thenumbers associated with each clade are the percentage of quartetssupporting the grouping according to the hypercleaning algorithm(Bryant
et al.
, 2000), and can be interpreted in a similar fashion tobootstrap values.
The simple asymmetric measure
K
(
x

y
)
leads to a wrongtree using the same data and programs, as expected (datanot shown). The gene order (Boore and Brown, 1998) andgene content (FitzGibbon and House, 1999; Snel
et al.
,1999) approaches, while yielding symmetric distances,have the disadvantage of requiring laborious humananalysis of the sequences and also are unlikely to provideenough information to distinguish closely related speciessuch as the above data set. Our method is fully automaticand utilizes the information contained in noncodingregions in addition to the information contained in thegenes.To further assure our result, we have extracted thecoding regions only from mtDNAs of the above species,and performed the same computation. We have obtainedthe same tree.
Phylogenetic position of the rodents
We also analyzed a larger dataset derived from the 34taxamitochondrial genome phylogeny in Reyes
et al.
(2000).
9998979997969099
Fig. 2.
The evolutionary tree built from the complete mammalianmtDNA sequences of the taxa analyzed in Reyes
et al.
(2000). Thenumbers associated with each clade are the percentage of quartetssupporting the grouping according to the hypercleaning algorithm(Bryant
et al.
, 2000), and can be interpreted in a similar fashion tobootstrap values. Only values less than 100 are shown.
This dataset included 19 of the 20 taxa in Cao
et al.
(1998) and the additional 15 taxa: aardvark (
Orycteropusafer
), armadillo (
Dasypus novemcintus
), baboon (
Papiohamadryas
), dog (
Canis familiaris
), donkey (
Equusasinus
), dormouse (
Glis glis
), elephant (
Loxodontaafricana
), fruit bat (
Artibeus jamaicensis
), great rhino(
Rhinoceros unicornis
), guinea pig (
Cavia porcellus
),hedgehog (
Erinaceus europaeus
), hippo (
Hippopotamusamphibius
), pig (
Sus scrofa
), rabbit (
Oryctolagus cuniculus
), sheep (
Ovis aries
), and squirrel (
Sciurus vulgaris
).This denser, more diverse, and more controversial datasetyielded a distance matrix, which, when analyzed byneighborjoining and hypercleaning, resulted in twosomewhat different phylogenies. The consensus of thesetwo phylogenies is presented in Figure 2.
152
Informationbased sequence distance and mitochondrial genome phylogeny
While this consensus agrees with the overall structureof the phylogeny presented in Reyes
et al.
(2000),including the possible nonmonophyleticity of the rodents,several discrepancies exist. In particular, our methodgroups the pig with the perisodactyls rather than withthe cetartiodactyls, the carnivores form a ferungulateoutgroup, and the guinea pig groups with neither themurid nor the nonmurid rodents. However, neither of theselatter two discrepancies are unreasonable hypotheses. Theoutgroup status of the carnivores has been suggested byGraur
et al.
(1997), and the phylogenetic position of theguinea pig is one of the most controversial topics insystematic biology (Graur
et al.
, 1991; D’Erchia
et al.
,1996; Cao
et al.
, 1997; Sullivan and Swofford, 1997;Reyes
et al.
, 1998).
CONCLUSIONS
Our goal in this paper is not to conﬁrm or refute previousphylogenetic studies but rather to bring a new methodology and a new tool to the comparative genomics researchcommunity. Our new method for whole genome comparison and phylogeny does not require gene identiﬁcation norany human intervention, in fact, it is totally automatic. Itis mathematically wellfounded being based on general information theoretic concepts. It works when there are noagreed upon evolutionary models, as further demonstratedby the successful construction of a chain letter phylogeny(manuscript in preparation by Bennett,C.H., Li,M., andMa,B.), and when individual gene trees do not agree (e.g.Cao
et al.
, 1998) as is the case for genomes. Another possible use of our method is as an evaluator of alignments,as alignments with little shared information are unlikely toyield meaningful phylogenies by any method. Althougha possible criticism of our method is that it is based oninformation theory rather than a biological model, it isworth stressing that the alignment algorithms that biologists use today are in fact also based on information theory, although less rigorously.The method depends on the ability to efﬁciently andsharply approximate shared information, as this information is not computable. Our preliminary experiments haveshown that our estimation approach is fruitful. However,to improve our results we need a better conditional entropy estimator of DNA sequences, as distances betweenhighly divergent sequences tend to be similar to each otherby our current estimator, making deep branches in phylogenies difﬁcult to resolve. Still, our experiments havedemonstrated clearly that the method is useful even withour current estimator. With a sharper conditional entropyestimator, we believe that this method will accurately recover whole organismal genome phylogenies as well.
ACKNOWLEDGEMENTS
We thank Ford Doolittle, Brian Golding, MasamiHasegawa, and Huaichun Wang for providing very usefulinformation, and Tao Jiang, Pavel Pevzner, David Sankoff and Huaichum Wang. We especially thank Paul Vit´anyi,who pointed out several loopholes in an earlier versionof this paper, and suggested the new formulation of Theorem 2. A referee suggested Reyes
et al.
(2000) datato us. J.H.B. was supported by a CITO grant. X.C., S.K.,and M.L. were supported in part by a CityU researchgrant 7000875, P.K. was supported by NSERC ResearchGrant 160321 and a CITO grant, M.L. was also supportedin part by NSERC Research Grant OGP0046506, a CITOgrant, and an NSERC Steacie Fellowship. H.Z. wassupported by NSERC Research Grants OGP0046506 and160321.
REFERENCES
Adachi,J. and Hasegawa,M. (1996) MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood.
Comput. Sci. Monogr. Inst. Stat. Math.
,
28
, 1–150.Boore,J.L. and Brown,W.M. (1998) Big trees from little genomes:mitochondrial gene order as a phylogenetic tool.
Curr. Opin.Genet. Dev.
,
8
, 668–674.Bryant,D., Berry,V., Kearney,P., Li,M., Jiang,T., Wareham,T. andZhang,H. (2000) A practical algorithm for recovering the bestsupported edges of an evolutionary tree. In
Proceedingsof the Eleventh Annual ACMSIAM Symposium on Discrete Algorithms
.Cao,Y., Janke,A., Waddell,P.J., Westerman,M., Takenaka,O., Murata,S., Okada,N., P¨a¨abo,S. and Hasegawa,M. (1998) Conﬂictamong individual mitochondrial proteins in resolving the phylogeny of eutherian orders.
J. Mol. Evol.
,
47
, 307–322.Cao,Y., Okada,N. and Hasegawa,M. (1997) Phylogenetic positionof guinea pigs revisited.
Mol. Biol. Evol.
,
14
, 461–464.Chen,X., Kwong,S. and Li,M. (2000) A compression algorithm forDNA sequences based on approximate matching. In Shamir,R.,Miyano,S., Istrail,S., Pevzner,P. and Waterman,M. (eds),
Proceedings of the Fourth Annual International Conference on Com putational Molecular Biology (RECOMB)
. Association for Computing Machinery, Tokyo, Japan, pp. 107.D’Erchia,A.M., Gissi,C., Pesole,G., Saccone,C. and Arnason,U.(1996) The guinea pig is not a rodent.
Nature
,
381
, 597–599.FitzGibbon,S.T. and House,C.H. (1999) Whole genomebasedphylogenetic analysis of freeliving microorganisms.
Nucleic Acids Res.
,
27
, 4218–4222.Graur,D., Gouy,M. and Duret,L. (1997) Evolutionary afﬁnities of the order Perissodactyla and the phylogenetic status of thesuperordinal taxa Ungulata and Altungulata.
Mol. Phylogenet. Evol.
,
7
, 195–200.Graur,D.,Hide,W.A.andLi,W.H.(1991)Istheguineapigarodent?
Nature
,
351
, 649–652.Grumbach,S. and Tahi,F. (1994) A new challenge for compressionalgorithms: genetic sequences.
J. Info. Process. Manage.
,
30
,875–866.
153