A Kernel Approach to Tractable Bayesian Nonparametrics
Ferenc Husz´ar
University of Cambridge
Simon LacosteJulien
University of Cambridge
Abstract
Inference in popular nonparametric Bayesianmodels typically relies on sampling or otherapproximations. This paper presents ageneral methodology for constructing noveltractable nonparametric Bayesian methodsby applying the kernel trick to inference ina parametric Bayesian model. For example, Gaussian process regression can be derived this way from Bayesian linear regression. Despite the success of the Gaussianprocess framework, the kernel trick is rarelyexplicitly considered in the Bayesian literature. In this paper, we aim to ﬁll this gapand demonstrate the potential of applyingthe kernel trick to tractable Bayesian parametric models in a wider context than justregression. As an example, we present an intuitive Bayesian kernel machine for densityestimation that is obtained by applying thekernel trick to a Gaussian generative modelin feature space.
1 Introduction
The popularity of nonparametric Bayesian methodshas steadily risen in machine learning over the pastdecade. Bayesian inference in almost all current nonparametric models relies on approximations which typically involve Markov chain Monte Carlo, or more recently variational approximations[1]. There is an evergrowing interest in developing nonparametric Bayesianmethods in which inference and prediction can beexpressed in closed form and no approximations areneeded. Such methods would be quite useful and desired in many applications, but are unfortunately rareat present. Perhaps the only known example is Gaussian process (GP) regression. In GP regression [2], theposterior predictive distribution is a tractable Gaussian with parameters that can be computed from dataexactly in polynomial time. The method is widelyadopted and also has favourable frequentist asymptotic properties[3]. But what is the secret behindthe remarkable algorithmic clarity of GP regression?What makes closedform computations possible? Weargue that the key is that GP regression is also a
kernel machine
: the method is arrived at by applying thekernel trick in a parametric Bayesian model, namelylinear regression (see e.g. Chapter 2 in[2])Kernel methods[4, 5] use a simple trick, widely knownas the kernel trick, to overcome the limitations of alinear model: the observations
x
i
∈X
are ﬁrst embedded in a feature space
F
using a nonlinear mapping
ϕ
:
X →F
. A linear algorithm is then appliedon the embedded representations
φ
n
=
ϕ
(
x
n
) insteadof the observations themselves. If the algorithm onlymakes use of scalar products
φ
n
,
φ
m
, then by replacing all scalar products by tractable
kernel
evaluations
k
(
x
n
,
x
m
) :=
φ
n
,
φ
m
, the expressive power canbe substantially increased with only a minor increasein computational costs. Notably, the kernel trick allows one to construct nonparametric machine learningmethods from parametric ones.Despite its popularity in “nonBayesian” studies, thekernel trick is rarely considered as a construction toolin Bayesian nonparametrics. GP regression is a rare,if not the only example. GPs are therefore oftencalled
Bayesian kernel machines
. In this paper, weconsider ﬁnding new examples of Bayesian kernel machines, i.e. broadening the intersection between kernel machines and nonparametric Bayesian methods.For Bayesian nonparametrics, the kernel approach offers invaluable closedform computations and a rigorous analysis framework associated with reproducingkernel Hilbert spaces (RKHS). For kernel machines,a Bayesian formulation oﬀers beneﬁts, such as novelprobabilistic approaches to setting kernel hyperparameters and means of incorporating such methods in hierarchical Bayesian models[6,7,8].
a r X i v : 1 1 0 3 . 1 7 6 1 v 3 [ s t a t . M L ] 1 2 A u g 2 0 1 1
In this paper, we present a methodology for ﬁndingnovel Bayesian kernel machines, following the recipeof GP regression:1. Start with a simple Bayesian model of observations.2. Derive exact Bayesian inference in the model.3. Express the posterior predictive distribution interms of dot products and apply the kernel trick.The crucial point is ﬁnding the basic model in step1, in which both steps 2 and 3 are possible. Fortunately, this search is guided by intuitive orthonormalinvariance considerations that will be demonstrated inthis paper. We present an example Bayesian kernelmachine for density estimation, based on the linearGaussian generative model underlying principal component analysis (PCA)[9]. We show that the kerneltrick can be applied in the Bayesian method by choosing prior distributions over the parameters which preserve invariance to orthonormal transformations.The rest of the paper is organised as follows. In Sec.2,we review Bayesian inference in a Gaussian generative model and discuss consequences of applying thekernel trick to the predictive density. We considerthe inﬁnite dimensional feature spaces case in Subsec.2.2.In Sec.4, we present experiments on high di
mensional density estimation problems comparing ourmethod to other Bayesian and nonBayesian nonparametric methods.
2 A Gaussian model in feature space
Assume that we have observations
x
i
in a
d
dimensional Euclidean space and that our task is toestimate their density. In anticipation of the sequel, weembed the observations
x
i
into a highdimensional feature space
F
with an injective smooth nonlinear mapping
ϕ
:
X →F
. Our density estimation method isbased on a simple generative model on the ambient feature space of the embedded observations
φ
i
=
ϕ
(
x
i
).For now, think of
F
as a
D
dimensional Euclideanspace and
φ
i
as
arbitrary elements
of
F
(i.e. theyare not necessary constrained to lie on the
observation manifold
O
=
{
ϕ
(
x
) :
x
∈X}
). We suppose that
φ
i
were sampled from a Gaussian distribution withunknown mean
µ
and covariance
Σ
:
φ
1:
N

µ
,
Σ
∼N
(
µ
,
Σ
), i.i.d. (1)The notation
φ
1:
N
is used to denote the set of vectors
φ
1
up to
φ
N
.Now we will consider estimating parameters of thismodel in a Bayesian way. In Bayesian inference, onedeﬁnes prior distributions over the parameters andthen uses Bayes’ rule to compute the posterior overthem. Importantly, now we also require that the resulting Bayesian procedure is still amenable to the kernel trick. We therefore start by discussing a necessarycondition for kernelisation which can then guide ourchoice of prior distributions.The kernel trick requires that the algorithm be expressed solely in terms of scalar products
φ
i
,
φ
j
.Scalar products are invariant under orthonormal transformations of the space, i.e. if
A
is an orthonormaltransformation then
u
,
v
=
A
u
,
A
v
for all
u
, and
v
in the space. Thus, if one wants to express an algorithm in terms of scalar products, it has to be – atleast – invariant under orthonormal transformations,such as rotations, reﬂections and permutations. It iswell known that PCA has this property, but, for example, factor analysis (FA) does not[9], thus one cannotexpect a kernel version of FA without any restrictions.Another desired property of the method is analyticalconvenience and tractability, which can be ensured byusing conjugate priors. The conjugate prior of theGaussian likelihood is the
NormalinverseWishart
,which in our case has to be restricted to meet the orthonormal invariance condition:
Σ
;
σ
20
,α
∼W
−
1
σ
20
I
,α
µ

Σ
,β
∼N
0
,
1
β
Σ
,
(2)where
W
−
1
and
N
denote the inverseWishart andGaussian distributions with the usual parametrisation.The general NormalinverseWishart family had to berestricted in two ways: ﬁrstly, the mean of
µ
was set tozero; secondly, the scale matrix of the inverseWishartwas set to be spherical. These sensible restrictions ensure that the marginal distribution of
φ
is centeredat the srcin and is spherically symmetric and therefore orthonormal invariance holds, which is requiredfor kernelisation.Having deﬁned the hierarchical generative model ineqns.(1)–(2), our task is to estimate the density of
φ
’s given the previous observations
φ
1:
N
, which in aBayesian framework is done by calculating the following posterior predictive distribution:
p
(
φ

φ
1:
N
;
σ
20
,α,β
) =
p
(
φ

µ
,
Σ
)
p
(
µ
,
Σ

φ
1:
N
;
σ
20
,α,β
)
d
µ
d
Σ
By straightforward computation, it can be shownthat the posterior predictive distribution is a
D
dimensional student
T
distribution of the form (with
the dependence on hyperparameters made implicit):
p
(
φ

φ
1:
N
)
∝
(3)
γ
+˜
φ
T
σ
20
I
+˜
Φ
I
−
11
T
N
+
β
˜
Φ
T
−
1
˜
φ
−
1+
N
+
α
2
.
where
γ
=
1+
β
+
N β
+
N
,˜
φ
=
φ
−
N
¯
φ
N
+
β
,˜
Φ
=
φ
1
−
¯
φ
,...,
φ
N
−
¯
φ
,¯
φ
=
1
N
N n
=1
φ
n
is the empirical mean in feature space and
1
is a
N
×
1 vector of ones.In order to obtain an expression which only containsscalar products, we invoke Woodbury’s matrix inversion formula[10]:
p
(
φ

φ
1:
N
)
∝
γ
+˜
φ
T
˜
φ
σ
20
−
(4)˜
φ
T
˜
Φ
σ
40
I
+
11
T
β
+
σ
20
˜
Φ
T
˜
Φ
−
1
˜
Φ
T
˜
φ
−
1+
N
+
α
2
.
2.1 Kernel trick
Until now, we assumed that
φ
was an arbitrary pointin
D
dimensional space. In reality, however, we onlywant to assign probabilities (4) to points on the socalled observation manifold,
O
=
{
ϕ
(
x
) :
x
∈X}
, i.e.to points that can be realised by mapping an observation
x
∈ X
to feature space
φ
=
ϕ
(
x
). Restrictedto
O
, we can actually make use of the kernel trick,assuming as well that
φ
i
=
ϕ
(
x
i
) for the previousobservations. Indeed, Eq. (4) then only depends on
x
1:
N
and
x
through˜
φ
T
˜
φ
,˜
φ
T
˜
Φ
and˜
Φ
T
˜
Φ
, which canall be expressed in terms of pairwise scalar products
φ
T
n
φ
m
=
ϕ
(
x
n
)
,ϕ
(
x
m
)
=
k
(
x
n
,
x
m
) as follows:˜
φ
T
˜
φ
=
k
(
x
,
x
)
−
2
i
k
(
x
,
x
i
)
N
+
β
+
i,j
k
(
x
i
,
x
j
)(
N
+
β
)
2
(5)
˜
φ
T
˜
Φ
n
=
k
(
x
,
x
n
)
−
i
k
(
x
,
x
i
)
N
+
i,j
k
(
x
i
,
x
j
)
−
N
i
k
(
x
n
,
x
i
)
N
(
N
+
β
)
˜
Φ
T
˜
Φ
n,m
=
k
(
x
n
,
x
m
)
−
i
k
(
x
n
,
x
i
) +
i
k
(
x
m
,
x
i
)
N
+
i,j
k
(
x
i
,
x
j
)
N
2
As said previously, we are only interested in pointsin the (curved) observation manifold. The restriction of the predictive distribution (4)to the observation manifold induces the same density function
q
(
ϕ
(
x
)

x
1:
N
) =
p
(
ϕ
(
x
)

φ
1:
N
) (which is now unnormalised), but with respect to a
d
dimensional Lebesguebase measure in the geodesic coordinate system of themanifold. Finally, we map the density
q
(
ϕ
(
x
)

x
1:
N
)back to the srcinal input space by implicitly invertingthe mapping
ϕ
, yielding the (unnormalised) predictivedensity
q
k
(
x

x
1:
N
) deﬁned on the input space. By doing so, we need to include a multiplicative Jacobiancorrection term which relates volume elements in thetangent space of the manifold to volume elements inthe input space
X
:
q
k
(
x

x
1:
N
) =
q
(
ϕ
(
x
)

x
1:
N
)
·
det
∂ϕ
(
x
)
∂x
i
,∂ϕ
(
x
)
∂x
j
12
i,j
=1
,...,d
.
(6)This formula can be be derived from the standardchangeofvariable formula for densities, as we showin the AppendixA.So what exactly did we gain by applying the kernel trick? Despite the fact that the simple densitymodel in the whole feature space is unimodal, in thelocal coordinate system of the nonlinear observationmanifold, the predictive density may appear multimodal and hence possess interesting nonlinear features. This is illustrated in Fig.1. Assume thatwe observe draws from a complicated distribution,which cannot be conveniently modelled with a linearGaussian model (Fig.1.A). We map observations toa two dimensional feature space using the mapping
ϕ
1
(
x
) =
x,ϕ
2
(
x
) = (
x
)
2
hoping that the embedded observations are better ﬁtted by a Gaussian, and carryout inference there (Fig.1.B). Note how all possibleobservations in the srcinal space get mapped to theobservation manifold
O
, which is now the parabola
φ
2
= (
φ
1
)
2
. The outcome of inference is a posteriorpredictive distribution in feature space, which takesa student
T
form. To obtain the predictive distribution in the observation space we have to look at themagnitude of the predictive distribution along the observation manifold. This is illustrated by the contourlines in panel B. The restricted predictive distribution is then “pulled back” to observation space andhas to be multiplied by the Jacobian term (Fig.1.C)to yield the ﬁnal estimation of the density of the observed samples (Fig.1.D). Remarkably, our methodcomputes the unnormalised density estimate (beforethe Jacobian term is added, Fig.1.C) directly fromthe data (Fig.1.A). All intermediate steps, includingthe inversion of the mapping, are
implicit
in the problem formulation, and we never have to work with thefeature space directly. As the method essentially estimates the density by a student
T
in feature space, weshall call it
kernel student
T
density estimation
.
A
1
x
10
p
(
x
)
3
B
1
φ
1
10
φ
2
1
C
1
x
103
D
1
x
10
p
(
x
)
3Figure 1: Illustration of kernel student
T
density estimation on a toy problem.
A,
Fourteen data points
x
1:14
(
crosses
) drawn from from a mixture distribution (
solid curve
).
B,
Embedded observations
φ
1:14
(
crosses
) in thetwo dimensional feature space and contours of the predictive density
q
(
φ
15

x
1:14
). The observation manifold isa parabola (
solid curve
), and we are interested in the magnitude of the predictive distribution to this manifold(
colouring
).
C,
The restricted posterior predictive distribution pulled back to the real line (
solid curve
) getsmultiplied by the Jacobian term
√
1 + 4
x
2
(
dotted curve
) to give
D,
the predictive distribution
q
k
(
x
15

x
1:14
)(
solid curve
) in observation space. All distributions are scaled so that they normalise to one.
2.2 Inﬁnite dimensional feature spaces
Of course, we constructed the previous toy exampleso that the simple normal model in this second order polynomial feature space is able to model the bimodal data distribution. Should the data distribution be more complicated, e.g. would have three ormore modes, the second order features would hardlybe enough to pick up all relevant statistical properties.Intuitively, adding more features results in higher potential expressive power, while the Bayesian integration safeguards us from overﬁtting. In practice, therefore, we will use the method in conjunction with rich,perhaps inﬁnite dimensional feature spaces, which willallow us to relax these parametric restrictions and toapply our method to model a wide class of probabilitydistributions.However, moving to inﬁnite dimensional feature spacesintroduces additional technical diﬃculties. As for one,Lebesgue measures do not exist in these spaces, therefore our derivations based on densities with respect toEuclidean base measure should be reconsidered. Fora fully rigorous description of the method in inﬁnitespaces, one may consider using Gaussian base measures instead. In this paper, we address two speciﬁcissues that arise when the feature space is large or inﬁnite dimensional.Firstly, because of the Jacobian correction term, thepredictive density
q
k
in input space is still not fully kernelised in (6). We note though that this term is thedeterminant of a small
d
×
d
matrix of inner products,which can actually be computed eﬃciently for somekernels. For example, the allsubsets kernel[11]has afeature space with exponential dimension
D
= 2
d
, buteach inner product of derivatives can be computed in
O
(
d
). But what about nonparametric kernels with inﬁnite
D
? Interestingly, we found that shift invariantkernels – for which
k
(
x
,
y
) =
k
(
x
+
z
,
y
+
z
) for all
z
– actually yield a
constant Jacobian term
(see proof inthe AppendixB) so it can be safely ignored to obtaina fully kernelised – although unnormalised – predictive density
q
k
(
x

x
1:
N
) =
q
(
ϕ
(
x
)

x
1:
N
). This result isimportant, as the most commonly used nonparametrickernels, such as the squared exponential, or the Laplacian fall into this category.The second issue is normalisation: Eq. (6) only expresses the predictive distribution up to a multiplicative constant. Furthermore, the derivations implicitlyassumed that
α > D
−
1, where
D
is the dimensionality of the feature space, otherwise the probabilitydensities involved become improper. This assumptionclearly cannot be satisﬁed when
D
is inﬁnite. However,one can argue that even if the densities in the featurespace are improper, the predictive distributions thatwe use may still be normalisable when restricted to theobservation manifold for
α > d
−
1. Indeed, the twodimensional student
T
is normalisable when restrictedto the parabola as in Fig.1for all
α >
0 rather than
α >
1 (to show this, consider
N
= 0 so that (3) takesthe simple form
q
(
φ
) = (cnst. +
φ
T
φ
)
−
1+
α
2
). So although distributions in feature space may be improper,the predictive densities may remain welldeﬁned evenwhen nonparametric kernels are used.Given these observations, we thus decide to
formally
apply equation(6) for our predictive density estimation algorithm, ignoring the normalisation constantand also the Jacobian term when shiftinvariant nonparametric kernels are applied.
A
:
α
= 0
.
01
,β
= 0
.
01
B
:
α
= 3
,β
= 0
.
01
C
:
α
= 10
,β
= 0
.
01Figure 2: Fantasy data drawn from a two dimensional kernel student
T
model with SE kernel. The lengthscale
= 1 (
bars below panels
) and
β
= 0
.
01 are ﬁxed,
α
varies between 0
.
01 and 10 (
A
to
C
). Panels show 50 points(
blue dots
) drawn from the generative model and the predictive density (
gray level
) of the 51stdraw.Finally, we want to emphasize that the method doesnot imply nor does it require that arbitrary distributions, when embedded in rich Hilbert spaces, will lookexactly like Gaussian or student
T
measures. Instead,the method uses the potentially inﬁnitely ﬂexible family of Gaussian measures to approximately model distributions in the feature space, and uses Bayesian integration to address modelmisspeciﬁcation via explicitrepresentation of uncertainty. As demonstrated inFig.1,the richness and ﬂexibility of the density modelstems from projecting the distribution onto a nonlinearobservation manifold, allowing us to overcome limitations of Gaussian measures, such as unimodality.
3 Related work
Our method is closely related to kernel principal component analysis[12, kPCA] as both of them essentiallyuse the same underlying generative model (Eq.1).While our method is based on Bayesian inference,kPCA performs constrained maximum likelihood estimation[13] of parameters
µ
and
Σ
. kPCA is avery eﬀective and popular method for nonlinear feature extraction. Its pitfall from a density estimationperspective is that it does not generally induce a sensible probability distribution in observation space. Tounderstand this, consider performing kPCA of orderone (i.e. recover just a single nonlinear component)on the toy data in Fig.1.A. The solution deﬁnes a degenerate Gaussian distribution in feature space whichis concentrated along a line, roughly the long axesof the equiprobability ellipses in Fig.1.B. As such adistribution can intersect the observation manifold atmost twice, the predictive distribution in observationspace will be a mixture of two delta distributions. Thisdegeneracy can be ruled out by recovering at least asmany nonlinear components as the dimensionality of feature space, but this is clearly not a viable optionin inﬁnite or large dimensional feature spaces. TheBayesian approach sidesteps the degeneracy problemby integrating out the mean and covariance, therebyaveraging many, possibly degenerate distributions toobtain a nondegenerate, smooth student
T
distribution in feature space, which results in a smooth densityestimate in observation space.Another particularly interesting related topic is thatof
characteristic kernels
[14]. A positive deﬁnite kernel
k
(
x,x
) =
ϕ
(
x
)
,ϕ
(
x
)
is called characteristic if the mean element
µ
π
=
E
X
∼
π
[
ϕ
(
X
)] uniquely characterises any probability measure
π
. From a densityestimation perspective, this suggests that estimatinga distribution in observation space boils down to estimating the mean in a characteristic kernel space. Thekernel moment matching framework[15] tries to exploit characteristic kernels in a very direct way: parameters of an approximating distribution are chosenby minimising maximum mean discrepancy in featurespace. Our method can be interpreted as performingBayesian estimation of ﬁrst and second moments infeature space, therefore one may hope that work oncharacteristic kernels will help us further study andunderstand properties of the algorithm.The Gaussian process has been used as a nonparametric prior over functions for unsupervised learning tasksin several studies, such as in Gaussian process latentvariable models[16,GPLVMs] or the Gaussian process density sampler[8,GPDS]. What sets the presentapproach apart from these is that while they incorporated Gaussian processes as a building block in an un