a r X i v : 0 8 0 5 . 2 3 6 8 v 1 [ c s . L G ] 1 5 M a y 2 0 0 8
Journal of Machine Learning Research 1 (2008) 110 Submitted 04/08; Published 04/08
A Kernel Method for the TwoSample Problem
Arthur Gretton
arthur@tuebingen.mpg.de
MPI for Biological CyberneticsSpemannstrasse 38 72076, T¨ ubingen, Germany
Karsten M. Borgwardt
∗
kmb51@cam.ac.uk
University of CambridgeDepartment of Engineering Trumpington Street, CB2 1PZ Cambridge, United Kingdom
Malte J. Rasch
malte@igi.tugraz.ac.at
Graz University of Technology Inﬀeldgasse 16b/I 8010 Graz, Austria
Bernhard Sch¨olkopf
bernhard.schoelkopf@tuebingen.mpg.de
MPI for Biological CyberneticsSpemannstrasse 38 72076, T¨ ubingen, Germany
Alexander Smola
alex.smola@gmail.com
National ICT Australia Canberra, ACT 0200, Australia
Editor:
TBA
Abstract
We propose a framework for analyzing and comparing distributions, allowing us to designstatistical tests to determine if two samples are drawn from diﬀerent distributions. Ourtest statistic is the largest diﬀerence in expectations over functions in the unit ball of areproducing kernel Hilbert space (RKHS). We present two tests based on large deviationbounds for the test statistic, while a third is based on the asymptotic distribution of thisstatistic. The test statistic can be computed in quadratic time, although eﬃcient lineartime approximations are available. Several classical metrics on distributions are recoveredwhen the function space used to compute the diﬀerence in expectations is allowed to bemore general (eg. a Banach space). We apply our twosample tests to a variety of problems,including attribute matching for databases using the Hungarian marriage method, wherethey perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the ﬁrst such tests.
Keywords:
Kernel methods, two sample test, uniform convergence bounds, schemamatching, asymptotic analysis, hypothesis testing.
∗
. This work was carried out while K.M.B. was with the LudwigMaximiliansUniversit¨at M¨unchen.
c
2008 Gretton, Borgwardt, Rasch, Sch¨olkopf and Smola.
Gretton, Borgwardt, Rasch, Sch¨olkopf and Smola
1. Introduction
We address the problem of comparing samples from two probability distributions, by proposing statistical tests of the hypothesis that these distributions are diﬀerent (this is called thetwosample or homogeneity problem). Such tests have application in a variety of areas. Inbioinformatics, it is of interest to compare microarray data from identical tissue types asmeasured by diﬀerent laboratories, to detect whether the data may be analysed jointly, orwhether diﬀerences in experimental procedure have caused systematic diﬀerences in the datadistributions. Equally of interest are comparisons between microarray data from diﬀerenttissue types, either to determine whether two subtypes of cancer may be treated as statistically indistinguishable from a diagnosis perspective, or to detect diﬀerences in healthyand cancerous tissue. In database attribute matching, it is desirable to merge databasescontaining multiple ﬁelds, where it is not known in advance which ﬁelds correspond: theﬁelds are matched by maximising the similarity in the distributions of their entries.We test whether distributions
p
and
q
are diﬀerent on the basis of samples drawn fromeach of them, by ﬁnding a well behaved (e.g. smooth) function which is large on the pointsdrawn from
p
, and small (as negative as possible) on the points from
q
. We use as our teststatistic the diﬀerence between the mean function values on the two samples; when this islarge, the samples are likely from diﬀerent distributions. We call this statistic the MaximumMean Discrepancy (MMD).Clearly the quality of the MMD as a statistic depends on the class
F
of smooth functionsthat deﬁne it. On one hand,
F
must be “rich enough” so that the population MMDvanishes if and only if
p
=
q
. On the other hand, for the test to be consistent,
F
needsto be “restrictive” enough for the empirical estimate of MMD to converge quickly to itsexpectation as the sample size increases. We shall use the unit balls in universal reproducingkernel Hilbert spaces (Steinwart,2001) as our function classes, since these will be shown to
satisfy both of the foregoing properties (we also review classical metrics on distributions,namely the KolmogorovSmirnov and EarthMover’s distances, which are based on diﬀerentfunction classes). On a more practical note, the MMD has a reasonable computational cost,when compared with other twosample tests: given
m
points sampled from
p
and
n
from
q
, the cost is
O
(
m
+
n
)
2
time. We also propose a less statistically eﬃcient algorithmwith a computational cost of
O
(
m
+
n
), which can yield superior performance at a givencomputational cost by looking at a larger volume of data.We deﬁne three nonparametric statistical tests based on the MMD. The ﬁrst two, whichuse distributionindependent uniform convergence bounds, provide ﬁnite sample guaranteesof test performance, at the expense of being conservative in detecting diﬀerences between
p
and
q
. The third test is based on the asymptotic distribution of the MMD, and isin practice more sensitive to diﬀerences in distribution at small sample sizes. The presentwork synthesizes and expands on results of Gretton et al.(2007a,b),Smola et al.(2007), and
Song et al.(2008)
1
who in turn build on the earlier work of Borgwardt et al.(2006). Note
that the latter addresses only the third kind of test, and that the approach of Gretton et al.(2007a,b) employs a more accurate approximation to the asymptotic distribution of the teststatistic.
1. In particular, most of the proofs here were not provided byGretton et al.(2007a)
2
A Kernel Method for the TwoSample Problem
We begin our presentation in Section2with a formal deﬁnition of the MMD, and aproof that the population MMD is zero if and only if
p
=
q
when
F
is the unit ball of auniversal RKHS. We also review alternative function classes for which the MMD deﬁnes ametric on probability distributions. In Section3, we give an overview of hypothesis testingas it applies to the twosample problem, and review other approaches to this problem. Wepresent our ﬁrst two hypothesis tests in Section4,based on two diﬀerent bounds on thedeviation between the population and empirical MMD. We take a diﬀerent approach inSection5, where we use the asymptotic distribution of the empirical MMD estimate as thebasis for a third test. When large volumes of data are available, the cost of computing theMMD (quadratic in the sample size) may be excessive: we therefore propose in Section6a modiﬁed version of the MMD statistic that has a linear cost in the number of samples,and an associated asymptotic test. In Section7, we provide an overview of methods related to the MMD in the statistics and machine learning literature. Finally, in Section8,we demonstrate the performance of MMDbased twosample tests on problems from neuroscience, bioinformatics, and attribute matching using the Hungarian marriage method.Our approach performs well on high dimensional data with low sample size; in addition, weare able to successfully distinguish distributions on graph data, for which ours is the ﬁrstproposed test.
2. The Maximum Mean Discrepancy
In this section, we present the maximum mean discrepancy (MMD), and describe conditionsunder which it is a metric on the space of probability distributions. The MMD is deﬁned interms of particular function spaces that witness the diﬀerence in distributions: we thereforebegin in Section2.1by introducing the MMD for some arbitrary function space. In Section2.2, we compute both the population MMD and two empirical estimates when the associatedfunction space is a reproducing kernel Hilbert space, and we derive the RKHS function thatwitnesses the MMD for a given pair of distributions in Section2.3. Finally, we describe theMMD for more general function classes in Section2.4.
2.1 Deﬁnition of the Maximum Mean Discrepancy
Our goal is to formulate a statistical test that answers the following question:
Problem 1
Let
p
and
q
be Borel probability measures deﬁned on a domain
X
. Given observations
X
:=
{
x
1
,...,x
m
}
and
Y
:=
{
y
1
,...,y
n
}
, drawn independently and identically distributed (i.i.d.) from
p
and
q
, respectively, can we decide whether
p
=
q
?
To start with, we wish to determine a criterion that, in the population setting, takes on aunique and distinctive value only when
p
=
q
. It will be deﬁned based on Lemma 9.3.2 of Dudley(2002).
Lemma 1
Let
(
X
,d
)
be a metric space, and let
p,q
be two Borel probability measures deﬁned on
X
. Then
p
=
q
if and only if
E
x
∼
p
(
f
(
x
)) =
E
y
∼
q
(
f
(
y
))
for all
f
∈
C
(
X
)
, where
C
(
X
)
isthe space of bounded continuous functions on
X
.
Although
C
(
X
) in principle allows us to identify
p
=
q
uniquely, it is not practical to workwith such a rich function class in the ﬁnite sample setting. We thus deﬁne a more general
3
Gretton, Borgwardt, Rasch, Sch¨olkopf and Smola
class of statistic, for as yet unspeciﬁed function classes
F
, to measure the disparity between
p
and
q
(Fortet and Mourier,1953;M¨uller,1997).
Deﬁnition 2
Let
F
be a class of functions
f
:
X
→
R
and let
p,q,X,Y
be deﬁned as above.We deﬁne the maximum mean discrepancy (MMD) as
MMD[
F
,p,q
] := sup
f
∈
F
(
E
x
∼
p
[
f
(
x
)]
−
E
y
∼
q
[
f
(
y
)])
.
(1)
M¨ uller ( 1997 ) calls this an integral probability metric. A biased empirical estimate of the
MMD is
MMD
b
[
F
,X,Y
] := sup
f
∈
F
1
m
m
i
=1
f
(
x
i
)
−
1
n
n
i
=1
f
(
y
i
)
.
(2)The empirical MMD deﬁned above has an upward bias (we will deﬁne an unbiased statisticin the following section). We must now identify a function class that is rich enough touniquely identify whether
p
=
q
, yet restrictive enough to provide useful ﬁnite sampleestimates (the latter property will be established in subsequent sections).
2.2 The MMD in Reproducing Kernel Hilbert Spaces
If
F
is the unit ball in a reproducing kernel Hilbert space
H
, the empirical MMD can becomputed very eﬃciently. This will be the main approach we pursue in the present study.Other possible function classes
F
are discussed at the end of this section. We will refer to
H
as universal whenever
H
, deﬁned on a compact metric space
X
and with associated kernel
k
:
X
2
→
R
, is dense in
C
(
X
) with respect to the
L
∞
norm. It is shown inSteinwart(2001)
that Gaussian and Laplace kernels are universal. We have the following result:
Theorem 3
Let
F
be a unit ball in a universal RKHS
H
, deﬁned on the compact metricspace
X
, with associated kernel
k
(
·
,
·
)
. Then
MMD[
F
,p,q
] = 0
if and only if
p
=
q
.
Proof
It is clear that MMD[
F
,p,q
] is zero if
p
=
q
. We prove the converse by showingthat MMD[
C
(
X
)
,p,q
] =
D
for some
D >
0 implies MMD[
F
,p,q
]
>
0: this is equivalent toMMD[
F
,p,q
] = 0 implying MMD[
C
(
X
)
,p,q
] = 0 (where this last result implies
p
=
q
byLemma1, noting that compactness of the metric space
X
implies its separability). Let
H
bethe universal RKHS of which
F
is the unit ball. If MMD[
C
(
X
)
,p,q
] =
D
, then there existssome˜
f
∈
C
(
X
) for which
E
p
˜
f
−
E
q
˜
f
≥
D/
2. We know that
H
is dense in
C
(
X
) withrespect to the
L
∞
norm: this means that for
ǫ
=
D/
8, we can ﬁnd some
f
∗
∈
H
satisfying
f
∗
−
˜
f
∞
< ǫ
. Thus, we obtain
E
p
[
f
∗
]
−
E
p
˜
f
< ǫ
and consequently

E
p
[
f
∗
]
−
E
q
[
f
∗
]

>
E
p
˜
f
−
E
q
˜
f
−
2
ǫ >
D
2
−
2
D
8
=
D
4
>
0
.
Finally, using
f
∗
H
<
∞
, we have[
E
p
[
f
∗
]
−
E
q
[
f
∗
]]
/
f
∗
H
≥
D/
(4
f
∗
H
)
>
0
,
and hence MMD[
F
,p,q
]
>
0.
4
A Kernel Method for the TwoSample Problem
We now review some properties of
H
that will allow us to express the MMD in a moreeasily computable form (Sch¨olkopf and Smola,2002). Since
H
is an RKHS, the operatorof evaluation
δ
x
mapping
f
∈
H
to
f
(
x
)
∈
R
is continuous. Thus, by the Riesz representation theorem, there is a feature mapping
φ
(
x
) from
X
to
R
such that
f
(
x
) =
f,φ
(
x
)
H
.Moreover,
φ
(
x
)
,φ
(
y
)
H
=
k
(
x,y
), where
k
(
x,y
) is a positive deﬁnite kernel function. Thefollowing lemma is due toBorgwardt et al.(2006).
Lemma 4
Denote the expectation of
φ
(
x
)
by
µ
p
:=
E
p
[
φ
(
x
)]
(assuming its existence).
2
Then
MMD[
F
,p,q
] = sup
f
H
≤
1
µ
[
p
]
−
µ
[
q
]
,f
=
µ
[
p
]
−
µ
[
q
]
H
.
(3)
Proof
MMD
2
[
F
,p,q
] =
sup
f
H
≤
1
(
E
p
[
f
(
x
)]
−
E
q
[
f
(
y
)])
2
=
sup
f
H
≤
1
(
E
p
[
φ
(
x
)
,f
H
]
−
E
q
[
φ
(
y
)
,f
H
])
2
=
sup
f
H
≤
1
µ
p
−
µ
q
,f
H
2
=
µ
p
−
µ
q
2
H
Given we are in an RKHS, the norm
µ
p
−
µ
q
2
H
may easily be computed in terms of kernelfunctions. This leads to a ﬁrst empirical estimate of the MMD, which is unbiased.
Lemma 5
Given
x
and
x
′
independent random variables with distribution
p
, and
y
and
y
′
independent random variables with distribution
q
, the population
MMD
2
is
MMD
2
[
F
,p,q
] =
E
x,x
′
∼
p
k
(
x,x
′
)
−
2
E
x
∼
p,y
∼
q
[
k
(
x,y
)] +
E
y,y
′
∼
q
k
(
y,y
′
)
.
(4)
Let
Z
:= (
z
1
,...,z
m
)
be
m
i.i.d. random variables, where
z
i
:= (
x
i
,y
i
)
(i.e. we assume
m
=
n
). An
unbiased
empirical estimate of
MMD
2
is
MMD
2
u
[
F
,X,Y
] =1(
m
)(
m
−
1)
m
i
=
j
h
(
z
i
,z
j
)
,
(5)
which is a onesample Ustatistic with
h
(
z
i
,z
j
) :=
k
(
x
i
,x
j
)+
k
(
y
i
,y
j
)
−
k
(
x
i
,y
j
)
−
k
(
x
j
,y
i
)
(we deﬁne
h
(
z
i
,z
j
)
to be symmetric in its arguments due to requirements that will arise in Section 5 ).
Proof
Starting from the expression for MMD
2
[
F
,p,q
] in Lemma4,MMD
2
[
F
,p,q
] =
µ
p
−
µ
q
2
H
=
µ
p
,µ
p
H
+
µ
q
,µ
q
H
−
2
µ
p
,µ
q
H
=
E
p
φ
(
x
)
,φ
(
x
′
)
H
+
E
q
φ
(
y
)
,φ
(
y
′
)
H
−
2
E
p,q
φ
(
x
)
,φ
(
y
)
H
,
2. A suﬃcient condition for this is
µ
p
2
H
<
∞
, which is rearranged as
E
p
[
k
(
x,x
′
)]
<
∞
, where
x
and
x
′
are independent random variables drawn according to
p
. In other words,
k
is a trace class operator withrespect to the measure
p
.
5