11 pages

A Kernel Approach to Tractable Bayesian Nonparametrics

of 11
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
A Kernel Approach to Tractable Bayesian Nonparametrics
  A Kernel Approach to Tractable Bayesian Nonparametrics Ferenc Husz´ar University of Cambridge Simon Lacoste-Julien University of Cambridge Abstract Inference in popular nonparametric Bayesianmodels typically relies on sampling or otherapproximations. This paper presents ageneral methodology for constructing noveltractable nonparametric Bayesian methodsby applying the kernel trick to inference ina parametric Bayesian model. For exam-ple, Gaussian process regression can be de-rived this way from Bayesian linear regres-sion. Despite the success of the Gaussianprocess framework, the kernel trick is rarelyexplicitly considered in the Bayesian litera-ture. In this paper, we aim to fill this gapand demonstrate the potential of applyingthe kernel trick to tractable Bayesian para-metric models in a wider context than justregression. As an example, we present an in-tuitive Bayesian kernel machine for densityestimation that is obtained by applying thekernel trick to a Gaussian generative modelin feature space. 1 Introduction The popularity of nonparametric Bayesian methodshas steadily risen in machine learning over the pastdecade. Bayesian inference in almost all current non-parametric models relies on approximations which typ-ically involve Markov chain Monte Carlo, or more re-cently variational approximations[1]. There is an ever-growing interest in developing nonparametric Bayesianmethods in which inference and prediction can beexpressed in closed form and no approximations areneeded. Such methods would be quite useful and de-sired in many applications, but are unfortunately rareat present. Perhaps the only known example is Gaus-sian process (GP) regression. In GP regression [2], theposterior predictive distribution is a tractable Gaus-sian with parameters that can be computed from dataexactly in polynomial time. The method is widelyadopted and also has favourable frequentist asymp-totic properties[3]. But what is the secret behindthe remarkable algorithmic clarity of GP regression?What makes closed-form computations possible? Weargue that the key is that GP regression is also a ker-nel machine  : the method is arrived at by applying thekernel trick in a parametric Bayesian model, namelylinear regression (see e.g. Chapter 2 in[2])Kernel methods[4, 5] use a simple trick, widely knownas the kernel trick, to overcome the limitations of alinear model: the observations x i ∈X  are first em-bedded in a feature space F  using a nonlinear map-ping ϕ : X →F  . A linear algorithm is then appliedon the embedded representations φ n = ϕ ( x n ) insteadof the observations themselves. If the algorithm onlymakes use of scalar products  φ n , φ m  , then by re-placing all scalar products by tractable kernel  evalua-tions k ( x n , x m ) :=  φ n , φ m  , the expressive power canbe substantially increased with only a minor increasein computational costs. Notably, the kernel trick al-lows one to construct nonparametric machine learningmethods from parametric ones.Despite its popularity in “non-Bayesian” studies, thekernel trick is rarely considered as a construction toolin Bayesian nonparametrics. GP regression is a rare,if not the only example. GPs are therefore oftencalled Bayesian kernel machines  . In this paper, weconsider finding new examples of Bayesian kernel ma-chines, i.e. broadening the intersection between ker-nel machines and nonparametric Bayesian methods.For Bayesian nonparametrics, the kernel approach of-fers invaluable closed-form computations and a rigor-ous analysis framework associated with reproducingkernel Hilbert spaces (RKHS). For kernel machines,a Bayesian formulation offers benefits, such as novelprobabilistic approaches to setting kernel hyperparam-eters and means of incorporating such methods in hi-erarchical Bayesian models[6,7,8].   a  r   X   i  v  :   1   1   0   3 .   1   7   6   1  v   3   [  s   t  a   t .   M   L   ]   1   2   A  u  g   2   0   1   1  In this paper, we present a methodology for findingnovel Bayesian kernel machines, following the recipeof GP regression:1. Start with a simple Bayesian model of observa-tions.2. Derive exact Bayesian inference in the model.3. Express the posterior predictive distribution interms of dot products and apply the kernel trick.The crucial point is finding the basic model in step1, in which both steps 2 and 3 are possible. Fortu-nately, this search is guided by intuitive orthonormalinvariance considerations that will be demonstrated inthis paper. We present an example Bayesian kernelmachine for density estimation, based on the linearGaussian generative model underlying principal com-ponent analysis (PCA)[9]. We show that the kerneltrick can be applied in the Bayesian method by choos-ing prior distributions over the parameters which pre-serve invariance to orthonormal transformations.The rest of the paper is organised as follows. In Sec.2,we review Bayesian inference in a Gaussian genera-tive model and discuss consequences of applying thekernel trick to the predictive density. We considerthe infinite dimensional feature spaces case in Sub-sec.2.2.In Sec.4, we present experiments on high di- mensional density estimation problems comparing ourmethod to other Bayesian and non-Bayesian nonpara-metric methods. 2 A Gaussian model in feature space Assume that we have observations x i in a d -dimensional Euclidean space and that our task is toestimate their density. In anticipation of the sequel, weembed the observations x i into a high-dimensional fea-ture space F  with an injective smooth nonlinear map-ping ϕ : X →F  . Our density estimation method isbased on a simple generative model on the ambient fea-ture space of the embedded observations φ i = ϕ ( x i ).For now, think of  F  as a D -dimensional Euclideanspace and φ i as arbitrary elements  of  F  (i.e. theyare not necessary constrained to lie on the observa-tion manifold  O = { ϕ ( x ) : x ∈X} ). We suppose that φ i were sampled from a Gaussian distribution withunknown mean µ and covariance Σ : φ 1: N  | µ , Σ ∼N  ( µ , Σ ), i.i.d. (1)The notation φ 1: N  is used to denote the set of vectors φ 1 up to φ N  .Now we will consider estimating parameters of thismodel in a Bayesian way. In Bayesian inference, onedefines prior distributions over the parameters andthen uses Bayes’ rule to compute the posterior overthem. Importantly, now we also require that the re-sulting Bayesian procedure is still amenable to the ker-nel trick. We therefore start by discussing a necessarycondition for kernelisation which can then guide ourchoice of prior distributions.The kernel trick requires that the algorithm be ex-pressed solely in terms of scalar products  φ i , φ j  .Scalar products are invariant under orthonormal trans-formations of the space, i.e. if  A is an orthonormaltransformation then  u , v  = A u , A v  for all u , and v in the space. Thus, if one wants to express an al-gorithm in terms of scalar products, it has to be – atleast – invariant under orthonormal transformations,such as rotations, reflections and permutations. It iswell known that PCA has this property, but, for exam-ple, factor analysis (FA) does not[9], thus one cannotexpect a kernel version of FA without any restrictions.Another desired property of the method is analyticalconvenience and tractability, which can be ensured byusing conjugate priors. The conjugate prior of theGaussian likelihood is the Normal-inverse-Wishart  ,which in our case has to be restricted to meet the or-thonormal invariance condition: Σ ; σ 20 ,α ∼W  − 1  σ 20 I  ,α  µ | Σ ,β  ∼N   0 , 1 β  Σ  , (2)where W  − 1 and N  denote the inverse-Wishart andGaussian distributions with the usual parametrisation.The general Normal-inverse-Wishart family had to berestricted in two ways: firstly, the mean of  µ was set tozero; secondly, the scale matrix of the inverse-Wishartwas set to be spherical. These sensible restrictions en-sure that the marginal distribution of  φ is centeredat the srcin and is spherically symmetric and there-fore orthonormal invariance holds, which is requiredfor kernelisation.Having defined the hierarchical generative model ineqns.(1)–(2), our task is to estimate the density of  φ ’s given the previous observations φ 1: N  , which in aBayesian framework is done by calculating the follow-ing posterior predictive distribution:  p ( φ | φ 1: N  ; σ 20 ,α,β  ) =   p ( φ | µ , Σ )  p ( µ , Σ | φ 1: N  ; σ 20 ,α,β  ) d µ d Σ By straightforward computation, it can be shownthat the posterior predictive distribution is a D -dimensional student- T  distribution of the form (with  the dependence on hyper-parameters made implicit):  p ( φ | φ 1: N  ) ∝ (3)  γ  +˜ φ T  σ 20 I  +˜ Φ  I  − 11 T N  + β   ˜ Φ T  − 1 ˜ φ  − 1+ N  + α 2 . where γ  = 1+ β + N β + N  ,˜ φ = φ − N  ¯ φ N  + β ,˜ Φ =  φ 1 − ¯ φ ,..., φ N  − ¯ φ  ,¯ φ = 1 N   N n =1 φ n is the empir-ical mean in feature space and 1 is a N  × 1 vector of ones.In order to obtain an expression which only containsscalar products, we invoke Woodbury’s matrix inver-sion formula[10]:  p ( φ | φ 1: N  ) ∝  γ  +˜ φ T ˜ φ σ 20 − (4)˜ φ T ˜ Φ  σ 40  I  + 11 T β   + σ 20 ˜ Φ T ˜ Φ  − 1 ˜ Φ T ˜ φ  − 1+ N  + α 2 . 2.1 Kernel trick Until now, we assumed that φ was an arbitrary pointin D -dimensional space. In reality, however, we onlywant to assign probabilities (4) to points on the so-called observation manifold, O = { ϕ ( x ) : x ∈X} , points that can be realised by mapping an obser-vation x ∈ X  to feature space φ = ϕ ( x ). Restrictedto O , we can actually make use of the kernel trick,assuming as well that φ i = ϕ ( x i ) for the previousobservations. Indeed, Eq. (4) then only depends on x 1: N  and x through˜ φ T ˜ φ ,˜ φ T ˜ Φ and˜ Φ T ˜ Φ , which canall be expressed in terms of pairwise scalar products φ T n φ m =  ϕ ( x n ) ,ϕ ( x m )  = k ( x n , x m ) as follows:˜ φ T ˜ φ = k ( x , x ) − 2  i k ( x , x i ) N  + β  +  i,j k ( x i , x j )( N  + β  ) 2 (5)  ˜ φ T ˜ Φ  n = k ( x , x n ) −  i k ( x , x i ) N  +  i,j k ( x i , x j ) − N   i k ( x n , x i ) N  ( N  + β  )  ˜ Φ T ˜ Φ  n,m = k ( x n , x m ) −  i k ( x n , x i ) +  i k ( x m , x i ) N  +  i,j k ( x i , x j ) N  2 As said previously, we are only interested in pointsin the (curved) observation manifold. The restric-tion of the predictive distribution (4)to the obser-vation manifold induces the same density function q  ( ϕ ( x ) | x 1: N  ) = p ( ϕ ( x ) | φ 1: N  ) (which is now unnor-malised), but with respect to a d -dimensional Lebesguebase measure in the geodesic coordinate system of themanifold. Finally, we map the density q  ( ϕ ( x ) | x 1: N  )back to the srcinal input space by implicitly invertingthe mapping ϕ , yielding the (unnormalised) predictivedensity q  k ( x | x 1: N  ) defined on the input space. By do-ing so, we need to include a multiplicative Jacobiancorrection term which relates volume elements in thetangent space of the manifold to volume elements inthe input space X  : q  k ( x | x 1: N  ) = q  ( ϕ ( x ) | x 1: N  ) · det  ∂ϕ ( x ) ∂x i ,∂ϕ ( x ) ∂x j  12 i,j =1 ,...,d . (6)This formula can be be derived from the standardchange-of-variable formula for densities, as we showin the AppendixA.So what exactly did we gain by applying the ker-nel trick? Despite the fact that the simple densitymodel in the whole feature space is unimodal, in thelocal coordinate system of the non-linear observationmanifold, the predictive density may appear multi-modal and hence possess interesting nonlinear fea-tures. This is illustrated in Fig.1. Assume thatwe observe draws from a complicated distribution,which cannot be conveniently modelled with a linearGaussian model (Fig.1.A). We map observations toa two dimensional feature space using the mapping ϕ 1 ( x ) = x,ϕ 2 ( x ) = ( x ) 2 hoping that the embedded ob-servations are better fitted by a Gaussian, and carryout inference there (Fig.1.B). Note how all possibleobservations in the srcinal space get mapped to theobservation manifold O , which is now the parabola φ 2 = ( φ 1 ) 2 . The outcome of inference is a posteriorpredictive distribution in feature space, which takesa student- T  form. To obtain the predictive distribu-tion in the observation space we have to look at themagnitude of the predictive distribution along the ob-servation manifold. This is illustrated by the contourlines in panel B. The restricted predictive distribu-tion is then “pulled back” to observation space andhas to be multiplied by the Jacobian term (Fig.1.C)to yield the final estimation of the density of the ob-served samples (Fig.1.D). Remarkably, our methodcomputes the unnormalised density estimate (beforethe Jacobian term is added, Fig.1.C) directly fromthe data (Fig.1.A). All intermediate steps, includingthe inversion of the mapping, are implicit  in the prob-lem formulation, and we never have to work with thefeature space directly. As the method essentially esti-mates the density by a student- T  in feature space, weshall call it kernel student- T  density estimation  .  A -1 x 10     p      (     x      ) 3 B -1 φ 1 10       φ         2 1 C -1 x 103 D -1 x 10     p      (     x      ) 3Figure 1: Illustration of kernel student- T  density estimation on a toy problem. A, Fourteen data points x 1:14 ( crosses  ) drawn from from a mixture distribution ( solid curve  ). B, Embedded observations φ 1:14 ( crosses  ) in thetwo dimensional feature space and contours of the predictive density q  ( φ 15 | x 1:14 ). The observation manifold isa parabola ( solid curve  ), and we are interested in the magnitude of the predictive distribution to this manifold( colouring  ). C, The restricted posterior predictive distribution pulled back to the real line ( solid curve  ) getsmultiplied by the Jacobian term √  1 + 4 x 2 ( dotted curve  ) to give D, the predictive distribution q  k ( x 15 | x 1:14 )( solid curve  ) in observation space. All distributions are scaled so that they normalise to one. 2.2 Infinite dimensional feature spaces Of course, we constructed the previous toy exampleso that the simple normal model in this second or-der polynomial feature space is able to model the bi-modal data distribution. Should the data distribu-tion be more complicated, e.g. would have three ormore modes, the second order features would hardlybe enough to pick up all relevant statistical properties.Intuitively, adding more features results in higher po-tential expressive power, while the Bayesian integra-tion safeguards us from overfitting. In practice, there-fore, we will use the method in conjunction with rich,perhaps infinite dimensional feature spaces, which willallow us to relax these parametric restrictions and toapply our method to model a wide class of probabilitydistributions.However, moving to infinite dimensional feature spacesintroduces additional technical difficulties. As for one,Lebesgue measures do not exist in these spaces, there-fore our derivations based on densities with respect toEuclidean base measure should be reconsidered. Fora fully rigorous description of the method in infinitespaces, one may consider using Gaussian base mea-sures instead. In this paper, we address two specificissues that arise when the feature space is large or in-finite dimensional.Firstly, because of the Jacobian correction term, thepredictive density q  k in input space is still not fully ker-nelised in (6). We note though that this term is thedeterminant of a small d × d matrix of inner products,which can actually be computed efficiently for somekernels. For example, the all-subsets kernel[11]has afeature space with exponential dimension D = 2 d , buteach inner product of derivatives can be computed in O ( d ). But what about nonparametric kernels with in-finite D ? Interestingly, we found that shift invariantkernels – for which k ( x , y ) = k ( x + z , y + z ) for all z – actually yield a constant Jacobian term  (see proof inthe AppendixB) so it can be safely ignored to obtaina fully kernelised – although unnormalised – predic-tive density q  k ( x | x 1: N  ) = q  ( ϕ ( x ) | x 1: N  ). This result isimportant, as the most commonly used nonparametrickernels, such as the squared exponential, or the Lapla-cian fall into this category.The second issue is normalisation: Eq. (6) only ex-presses the predictive distribution up to a multiplica-tive constant. Furthermore, the derivations implicitlyassumed that α > D − 1, where D is the dimension-ality of the feature space, otherwise the probabilitydensities involved become improper. This assumptionclearly cannot be satisfied when D is infinite. However,one can argue that even if the densities in the featurespace are improper, the predictive distributions thatwe use may still be normalisable when restricted to theobservation manifold for α > d − 1. Indeed, the two-dimensional student- T  is normalisable when restrictedto the parabola as in Fig.1for all α > 0 rather than α > 1 (to show this, consider N  = 0 so that (3) takesthe simple form q  ( φ ) = (cnst. + φ T φ ) − 1+ α 2 ). So al-though distributions in feature space may be improper,the predictive densities may remain well-defined evenwhen nonparametric kernels are used.Given these observations, we thus decide to formally  apply equation(6) for our predictive density estima-tion algorithm, ignoring the normalisation constantand also the Jacobian term when shift-invariant non-parametric kernels are applied.  A : α = 0 . 01 ,β  = 0 . 01 B : α = 3 ,β  = 0 . 01 C : α = 10 ,β  = 0 . 01Figure 2: Fantasy data drawn from a two dimensional kernel student- T  model with SE kernel. The length-scale  = 1 ( bars below panels  ) and β  = 0 . 01 are fixed, α varies between 0 . 01 and 10 ( A to C ). Panels show 50 points( blue dots  ) drawn from the generative model and the predictive density ( gray level  ) of the 51stdraw.Finally, we want to emphasize that the method doesnot imply nor does it require that arbitrary distribu-tions, when embedded in rich Hilbert spaces, will lookexactly like Gaussian or student- T  measures. Instead,the method uses the potentially infinitely flexible fam-ily of Gaussian measures to approximately model dis-tributions in the feature space, and uses Bayesian inte-gration to address model-misspecification via explicitrepresentation of uncertainty. As demonstrated inFig.1,the richness and flexibility of the density modelstems from projecting the distribution onto a nonlinearobservation manifold, allowing us to overcome limita-tions of Gaussian measures, such as unimodality. 3 Related work Our method is closely related to kernel principal com-ponent analysis[12, kPCA] as both of them essentiallyuse the same underlying generative model (Eq.1).While our method is based on Bayesian inference,kPCA performs constrained maximum likelihood es-timation[13] of parameters µ and Σ . kPCA is avery effective and popular method for nonlinear fea-ture extraction. Its pitfall from a density estimationperspective is that it does not generally induce a sen-sible probability distribution in observation space. Tounderstand this, consider performing kPCA of orderone (i.e. recover just a single nonlinear component)on the toy data in Fig.1.A. The solution defines a de-generate Gaussian distribution in feature space whichis concentrated along a line, roughly the long axesof the equiprobability ellipses in Fig.1.B. As such adistribution can intersect the observation manifold atmost twice, the predictive distribution in observationspace will be a mixture of two delta distributions. Thisdegeneracy can be ruled out by recovering at least asmany nonlinear components as the dimensionality of feature space, but this is clearly not a viable optionin infinite or large dimensional feature spaces. TheBayesian approach sidesteps the degeneracy problemby integrating out the mean and covariance, therebyaveraging many, possibly degenerate distributions toobtain a non-degenerate, smooth student- T  distribu-tion in feature space, which results in a smooth densityestimate in observation space.Another particularly interesting related topic is thatof  characteristic kernels  [14]. A positive definite ker-nel k ( x,x  ) =  ϕ ( x ) ,ϕ ( x  )  is called characteristic if the mean element µ π = E X ∼ π [ ϕ ( X )] uniquely char-acterises any probability measure π . From a densityestimation perspective, this suggests that estimatinga distribution in observation space boils down to esti-mating the mean in a characteristic kernel space. Thekernel moment matching framework[15] tries to ex-ploit characteristic kernels in a very direct way: pa-rameters of an approximating distribution are chosenby minimising maximum mean discrepancy in featurespace. Our method can be interpreted as performingBayesian estimation of first and second moments infeature space, therefore one may hope that work oncharacteristic kernels will help us further study andunderstand properties of the algorithm.The Gaussian process has been used as a nonparamet-ric prior over functions for unsupervised learning tasksin several studies, such as in Gaussian process latentvariable models[16,GPLVMs] or the Gaussian pro-cess density sampler[8,GPDS]. What sets the presentapproach apart from these is that while they incorpo-rated Gaussian processes as a building block in an un-
Related Documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!