School Work

43 pages
12 views

A Kernel Method for the Two-Sample-Problem

of 43
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
A Kernel Method for the Two-Sample-Problem
Transcript
    a  r   X   i  v  :   0   8   0   5 .   2   3   6   8  v   1   [  c  s .   L   G   ]   1   5   M  a  y   2   0   0   8 Journal of Machine Learning Research 1 (2008) 1-10 Submitted 04/08; Published 04/08 A Kernel Method for the Two-Sample Problem Arthur Gretton arthur@tuebingen.mpg.de MPI for Biological CyberneticsSpemannstrasse 38 72076, T¨ ubingen, Germany  Karsten M. Borgwardt ∗ kmb51@cam.ac.uk University of CambridgeDepartment of Engineering Trumpington Street, CB2 1PZ Cambridge, United Kingdom  Malte J. Rasch malte@igi.tu-graz.ac.at Graz University of Technology Inffeldgasse 16b/I 8010 Graz, Austria  Bernhard Sch¨olkopf  bernhard.schoelkopf@tuebingen.mpg.de MPI for Biological CyberneticsSpemannstrasse 38 72076, T¨ ubingen, Germany  Alexander Smola alex.smola@gmail.com National ICT Australia Canberra, ACT 0200, Australia  Editor: TBA Abstract We propose a framework for analyzing and comparing distributions, allowing us to designstatistical tests to determine if two samples are drawn from different distributions. Ourtest statistic is the largest difference in expectations over functions in the unit ball of areproducing kernel Hilbert space (RKHS). We present two tests based on large deviationbounds for the test statistic, while a third is based on the asymptotic distribution of thisstatistic. The test statistic can be computed in quadratic time, although efficient lineartime approximations are available. Several classical metrics on distributions are recoveredwhen the function space used to compute the difference in expectations is allowed to bemore general (eg. a Banach space). We apply our two-sample tests to a variety of problems,including attribute matching for databases using the Hungarian marriage method, wherethey perform strongly. Excellent performance is also obtained when comparing distribu-tions over graphs, for which these are the first such tests. Keywords: Kernel methods, two sample test, uniform convergence bounds, schemamatching, asymptotic analysis, hypothesis testing. ∗ . This work was carried out while K.M.B. was with the Ludwig-Maximilians-Universit¨at M¨unchen. c  2008 Gretton, Borgwardt, Rasch, Sch¨olkopf and Smola.  Gretton, Borgwardt, Rasch, Sch¨olkopf and Smola 1. Introduction We address the problem of comparing samples from two probability distributions, by propos-ing statistical tests of the hypothesis that these distributions are different (this is called thetwo-sample or homogeneity problem). Such tests have application in a variety of areas. Inbioinformatics, it is of interest to compare microarray data from identical tissue types asmeasured by different laboratories, to detect whether the data may be analysed jointly, orwhether differences in experimental procedure have caused systematic differences in the datadistributions. Equally of interest are comparisons between microarray data from differenttissue types, either to determine whether two subtypes of cancer may be treated as sta-tistically indistinguishable from a diagnosis perspective, or to detect differences in healthyand cancerous tissue. In database attribute matching, it is desirable to merge databasescontaining multiple fields, where it is not known in advance which fields correspond: thefields are matched by maximising the similarity in the distributions of their entries.We test whether distributions p and q are different on the basis of samples drawn fromeach of them, by finding a well behaved (e.g. smooth) function which is large on the pointsdrawn from p , and small (as negative as possible) on the points from q . We use as our teststatistic the difference between the mean function values on the two samples; when this islarge, the samples are likely from different distributions. We call this statistic the MaximumMean Discrepancy (MMD).Clearly the quality of the MMD as a statistic depends on the class F  of smooth functionsthat define it. On one hand, F  must be “rich enough” so that the population MMDvanishes if and only if  p = q . On the other hand, for the test to be consistent, F  needsto be “restrictive” enough for the empirical estimate of MMD to converge quickly to itsexpectation as the sample size increases. We shall use the unit balls in universal reproducingkernel Hilbert spaces (Steinwart,2001) as our function classes, since these will be shown to satisfy both of the foregoing properties (we also review classical metrics on distributions,namely the Kolmogorov-Smirnov and Earth-Mover’s distances, which are based on differentfunction classes). On a more practical note, the MMD has a reasonable computational cost,when compared with other two-sample tests: given m points sampled from p and n from q , the cost is O ( m + n ) 2 time. We also propose a less statistically efficient algorithmwith a computational cost of  O ( m + n ), which can yield superior performance at a givencomputational cost by looking at a larger volume of data.We define three non-parametric statistical tests based on the MMD. The first two, whichuse distribution-independent uniform convergence bounds, provide finite sample guaranteesof test performance, at the expense of being conservative in detecting differences between  p and q . The third test is based on the asymptotic distribution of the MMD, and isin practice more sensitive to differences in distribution at small sample sizes. The presentwork synthesizes and expands on results of Gretton et al.(2007a,b),Smola et al.(2007), and Song et al.(2008) 1 who in turn build on the earlier work of Borgwardt et al.(2006). Note that the latter addresses only the third kind of test, and that the approach of Gretton et al.(2007a,b) employs a more accurate approximation to the asymptotic distribution of the teststatistic. 1. In particular, most of the proofs here were not provided byGretton et al.(2007a) 2  A Kernel Method for the Two-Sample Problem We begin our presentation in Section2with a formal definition of the MMD, and aproof that the population MMD is zero if and only if  p = q when F  is the unit ball of auniversal RKHS. We also review alternative function classes for which the MMD defines ametric on probability distributions. In Section3, we give an overview of hypothesis testingas it applies to the two-sample problem, and review other approaches to this problem. Wepresent our first two hypothesis tests in Section4,based on two different bounds on thedeviation between the population and empirical MMD. We take a different approach inSection5, where we use the asymptotic distribution of the empirical MMD estimate as thebasis for a third test. When large volumes of data are available, the cost of computing theMMD (quadratic in the sample size) may be excessive: we therefore propose in Section6a modified version of the MMD statistic that has a linear cost in the number of samples,and an associated asymptotic test. In Section7, we provide an overview of methods re-lated to the MMD in the statistics and machine learning literature. Finally, in Section8,we demonstrate the performance of MMD-based two-sample tests on problems from neu-roscience, bioinformatics, and attribute matching using the Hungarian marriage method.Our approach performs well on high dimensional data with low sample size; in addition, weare able to successfully distinguish distributions on graph data, for which ours is the firstproposed test. 2. The Maximum Mean Discrepancy In this section, we present the maximum mean discrepancy (MMD), and describe conditionsunder which it is a metric on the space of probability distributions. The MMD is defined interms of particular function spaces that witness the difference in distributions: we thereforebegin in Section2.1by introducing the MMD for some arbitrary function space. In Section2.2, we compute both the population MMD and two empirical estimates when the associatedfunction space is a reproducing kernel Hilbert space, and we derive the RKHS function thatwitnesses the MMD for a given pair of distributions in Section2.3. Finally, we describe theMMD for more general function classes in Section2.4. 2.1 Definition of the Maximum Mean Discrepancy Our goal is to formulate a statistical test that answers the following question: Problem 1 Let  p and  q be Borel probability measures defined on a domain  X . Given observations X  := { x 1 ,...,x m } and  Y  := { y 1 ,...,y n } , drawn independently and identically distributed (i.i.d.) from  p and  q , respectively, can we decide whether  p  = q ?  To start with, we wish to determine a criterion that, in the population setting, takes on aunique and distinctive value only when p = q . It will be defined based on Lemma 9.3.2 of Dudley(2002). Lemma 1 Let  ( X ,d ) be a metric space, and let  p,q be two Borel probability measures defined on  X . Then  p = q if and only if  E x ∼  p ( f  ( x )) = E y ∼ q ( f  ( y )) for all  f  ∈ C  ( X ) , where C  ( X ) isthe space of bounded continuous functions on  X . Although C  ( X ) in principle allows us to identify p = q uniquely, it is not practical to workwith such a rich function class in the finite sample setting. We thus define a more general 3  Gretton, Borgwardt, Rasch, Sch¨olkopf and Smola class of statistic, for as yet unspecified function classes F  , to measure the disparity between  p and q (Fortet and Mourier,1953;M¨uller,1997). Definition 2 Let  F  be a class of functions f  : X → R and let  p,q,X,Y  be defined as above.We define the maximum mean discrepancy (MMD) as MMD[ F  ,p,q ] := sup f  ∈ F  ( E x ∼  p [ f  ( x )] − E y ∼ q [ f  ( y )]) . (1) M¨ uller ( 1997 ) calls this an integral probability metric. A biased empirical estimate of the MMD is MMD b [ F  ,X,Y  ] := sup f  ∈ F   1 m m  i =1 f  ( x i ) − 1 n n  i =1 f  ( y i )  . (2)The empirical MMD defined above has an upward bias (we will define an unbiased statisticin the following section). We must now identify a function class that is rich enough touniquely identify whether p = q , yet restrictive enough to provide useful finite sampleestimates (the latter property will be established in subsequent sections). 2.2 The MMD in Reproducing Kernel Hilbert Spaces If  F  is the unit ball in a reproducing kernel Hilbert space H , the empirical MMD can becomputed very efficiently. This will be the main approach we pursue in the present study.Other possible function classes F  are discussed at the end of this section. We will refer to H as universal whenever H , defined on a compact metric space X and with associated kernel k : X 2 → R , is dense in C  ( X ) with respect to the L ∞ norm. It is shown inSteinwart(2001) that Gaussian and Laplace kernels are universal. We have the following result: Theorem 3 Let  F  be a unit ball in a universal RKHS  H , defined on the compact metricspace X , with associated kernel  k ( · , · ) . Then  MMD[ F  ,p,q ] = 0 if and only if  p = q . Proof  It is clear that MMD[ F  ,p,q ] is zero if  p = q . We prove the converse by showingthat MMD[ C  ( X ) ,p,q ] = D for some D > 0 implies MMD[ F  ,p,q ] > 0: this is equivalent toMMD[ F  ,p,q ] = 0 implying MMD[ C  ( X ) ,p,q ] = 0 (where this last result implies p = q byLemma1, noting that compactness of the metric space X implies its separability). Let H bethe universal RKHS of which F  is the unit ball. If MMD[ C  ( X ) ,p,q ] = D , then there existssome˜ f  ∈ C  ( X ) for which E  p  ˜ f   − E q  ˜ f   ≥ D/ 2. We know that H is dense in C  ( X ) withrespect to the L ∞ norm: this means that for ǫ = D/ 8, we can find some f  ∗ ∈ H satisfying  f  ∗ − ˜ f   ∞ < ǫ . Thus, we obtain  E  p [ f  ∗ ] − E  p  ˜ f   < ǫ and consequently | E  p [ f  ∗ ] − E q [ f  ∗ ] | >  E  p  ˜ f   − E q  ˜ f   − 2 ǫ > D 2 − 2 D 8 = D 4 > 0 . Finally, using  f  ∗  H < ∞ , we have[ E  p [ f  ∗ ] − E q [ f  ∗ ]] /  f  ∗  H ≥ D/ (4  f  ∗  H ) > 0 , and hence MMD[ F  ,p,q ] > 0. 4  A Kernel Method for the Two-Sample Problem We now review some properties of  H that will allow us to express the MMD in a moreeasily computable form (Sch¨olkopf and Smola,2002). Since H is an RKHS, the operatorof evaluation δ x mapping f  ∈ H to f  ( x ) ∈ R is continuous. Thus, by the Riesz represen-tation theorem, there is a feature mapping φ ( x ) from X to R such that f  ( x ) =  f,φ ( x )  H .Moreover,  φ ( x ) ,φ ( y )  H = k ( x,y ), where k ( x,y ) is a positive definite kernel function. Thefollowing lemma is due toBorgwardt et al.(2006). Lemma 4 Denote the expectation of  φ ( x ) by  µ  p := E  p [ φ ( x )] (assuming its existence). 2 Then  MMD[ F  ,p,q ] = sup  f   H ≤ 1  µ [  p ] − µ [ q ] ,f   =  µ [  p ] − µ [ q ]  H . (3) Proof  MMD 2 [ F  ,p,q ] =  sup  f   H ≤ 1 ( E  p [ f  ( x )] − E q [ f  ( y )])  2 =  sup  f   H ≤ 1 ( E  p [  φ ( x ) ,f   H ] − E q [  φ ( y ) ,f   H ])  2 =  sup  f   H ≤ 1  µ  p − µ q ,f   H  2 =  µ  p − µ q  2 H Given we are in an RKHS, the norm  µ  p − µ q  2 H may easily be computed in terms of kernelfunctions. This leads to a first empirical estimate of the MMD, which is unbiased. Lemma 5 Given  x and  x ′ independent random variables with distribution  p , and  y and  y ′ independent random variables with distribution  q , the population  MMD 2 is MMD 2 [ F  ,p,q ] = E x,x ′ ∼  p  k ( x,x ′ )  − 2 E x ∼  p,y ∼ q [ k ( x,y )] + E y,y ′ ∼ q  k ( y,y ′ )  . (4) Let  Z  := ( z 1 ,...,z m ) be m i.i.d. random variables, where z i := ( x i ,y i ) (i.e. we assume m = n ). An  unbiased empirical estimate of  MMD 2 is MMD 2 u [ F  ,X,Y  ] =1( m )( m − 1) m  i  =  j h ( z i ,z  j ) , (5) which is a one-sample U-statistic with  h ( z i ,z  j ) := k ( x i ,x  j )+ k ( y i ,y  j ) − k ( x i ,y  j ) − k ( x  j ,y i ) (we define h ( z i ,z  j ) to be symmetric in its arguments due to requirements that will arise in Section 5 ). Proof  Starting from the expression for MMD 2 [ F  ,p,q ] in Lemma4,MMD 2 [ F  ,p,q ] =  µ  p − µ q  2 H =  µ  p ,µ  p  H +  µ q ,µ q  H − 2  µ  p ,µ q  H = E  p  φ ( x ) ,φ ( x ′ )  H + E q  φ ( y ) ,φ ( y ′ )  H − 2 E  p,q  φ ( x ) ,φ ( y )  H , 2. A sufficient condition for this is  µ p  2 H < ∞ , which is rearranged as E p [ k ( x,x ′ )] < ∞ , where x and x ′ are independent random variables drawn according to p . In other words, k is a trace class operator withrespect to the measure p . 5
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x