A Knearest neighbours method based on lowerprevisions
Sebastien Destercke
INRA/CIRAD, UMR1208, 2 place P. Viala, F34060 Montpellier cedex 1, France
sebastien.destercke@supagro.inra.fr
Abstract.
Knearest neighbours algorithms are among the most popular existingclassiﬁcation methods, due to their simplicity and good performances. Over theyears, several extensions of the initial method have been proposed. In this paper,we propose a Knearest neighbours approach that uses the theory of impreciseprobabilities, and more speciﬁcally lower previsions. This approach handles verygeneric models when representing imperfect information on the labels of training data, and decision rules developed within this theory allows to deal with issues related to the presence of conﬂicting information or to the absence of closeneighbours. We also show that results of the classical voting KNN proceduresand distanceweighted
k
NN procedures can be retrieved.
Keywords
: Classiﬁcation, lower prevision, nearest neighbours.
1 Introduction
The knearest neighbours (KNN) classiﬁcation procedure is an old rule[1] that usesthe notion of similarity and distance with known instances to classify a new one. Givena vector
x
∈
R
D
of input features, a distance
d
:
R
D
×
R
D
→
R
and a data set of training samples composed of
N
couples
(
x
i
,y
i
)
where
x
i
∈
R
D
are feature valuesand
y
i
∈ Y
=
{
ω
1
,...,ω
M
}
is the class to which belongs the
i
th
sample, the voting
k
NN procedure consists in choosing as the class
y
of
x
the one that is in majority inthe
k
nearest neighbours.One of the main drawback of the srcinal algorithm is that it assumes that the
k
nearest neighbors are relatively close to the instance to classify, and can act as reliableinstances to estimate some conditional densities. It also assumes that all classes or patterns are well represented in the input feature space, and that the space is well sampled.In practice, this is rarely true, and the distance between a new instance and its nearestneighbour can be large. This makes the way basic kNN procedure treats the trainingsamples questionable Also, some classes of training samples may only be imperfectlyknown, and this uncertainty should be taken into account.To integrate these various features, many extensions of the initial method have beenproposed: use of weights to account for distance between neighbours and instance toclassiﬁcation[2]; use of distance and ambiguity rejection, to cope respectively withnearest neighbours whose distance from the instance to classify is too large and withnearestneighboursgivingconﬂictinginformation[3];useofuncertaintyrepresentationssuch as belief functions to cope with uncertainty [4]. For a detailed survey of the kNNalgorithm and its different extensions, see[5,Chap. 2].
As far as uncertainty representations are concerned, it can be argued that belief functions do not allow to model precisely all kinds of uncertainties. For example, theyare unable to model exactly uncertainty given by probability intervals (i.e., lower andupper probabilistic bounds given on each class). Imprecise probability theory and walley’s lower previsions[6] are uncertainty models that encompass belief functions asspecial cases. In this sense, they are more general and allow for a ﬁner modelling of uncertainty.In this paper, we propose and discuss a kNN rule based on the use of Walley’slower prevision [6,7], and of the theory underlying them. As for the TBM kNN procedure (based on evidence theory and on Dempster’s rule of combintion), it allows totreat all the issues mentioned above without introducing any other parameters than theweights on nearest neighbours, however it does so with a different approach (beingbased on different calculus) and allows the use of more general uncertainty models thanthe TBM. In particular, we argue that using decision rules proper to the lower previsionsapproach allows to take account of ambiguities and distances without having to includeadditional parameters. Using these imprecise decision rules, we also introduce a criteriaallowing to pick the "best" number k of nearest neighbours, balancing imprecision andaccuracy. After recalling the material concerning lower previsions (Section2) neededin this paper, we details the proposed method and its properties (Section3), beforeﬁnishing with some experiments (Section4).
2 Lower previsions
This section introduces the very basics about lower previsions and associated toolsneeded in this paper. We refer to Miranda[7] and Walley [6]for more details.
2.1 Basics of lower previsions
In this paper, we consider that information regarding a variable
X
assuming its valueson a (ﬁnite) space
X
counting
N
exclusive and disjoint elements is modelled by themeans of a socalled coherent lower previsions. We denote by
L
(
X
)
the set of realvaluedboundedfunctionson
X
.Alowerprevision
P
:
K →
R
isarealvaluedmappingon a subset
K ⊆ L
(
X
)
. Given a lower prevision, the dual notion of upper prevision
P
isdeﬁned on the set
−K
=
{−
f

f
∈ K}
and is such that
P
(
f
) =
−
P
(
−
f
)
. As discussedby Walley [6], lower previsions can be used to model information about the variable
X
.He interprets
P
(
f
)
as the supremum buying price for the uncertain reward
f
.Given a set
A
⊆ X
, its lower probability
P
(
A
)
is the lower prevision of its indicatorfunction
1
(
A
)
, that takes value one on
A
and zero elsewhere. The upper probability
P
(
A
)
of
A
is the upper prevision of
1
(
A
)
, and by duality
P
(
A
) = 1
−
P
(
A
c
)
. To alower prevision
P
can be associated a convex set
P
P
of probabilities, such that
P
P
=
{
p
∈
P
X

(
∀
f
∈ K
)(
E
p
(
f
)
≥
P
(
f
))
}
with
P
X
thesetofallprobabilitymassfunctionsover
P
X
and
E
p
(
f
) =
x
∈X
p
(
x
)
f
(
x
)
the expected value of
f
given
p
. As often done,
P
P
will be called the credal set of
P
.
A lower prevision is said to avoid sure loss iff
P
P
=
∅
and to be coherent iff itavoids sure loss and
∀
f
∈ K
,
P
(
f
) = min
{
E
p
(
f
)

p
∈ P
P
}
, i.e. iff
P
is the lowerenvelope of
P
P
. If a lower (upper) prevision is coherent, it corresponds to the lower(upper) expectation of
P
P
. If a lower prevision
P
avoids sure loss, its natural extension
E
(
g
)
to a function
g
∈ L
(
X
)
is deﬁned as
E
(
g
) = min
{
E
p
(
g
)

p
∈ P
P
}
. Note that
P
and its natural extension
E
coincide on
K
only when
P
is coherent, otherwise
P
≤
E
and
P
(
f
)
< E
(
f
)
for at least one
f
.Lower previsions are very general uncertainty models, in that they encompass (atleast from a static viewpoint) most of the other known uncertainty models. In particular both necessity measures of possibility theory [8]and belief measures of evidencetheory [9]can be seen as particular lower previsions.
2.2 Vacuous mixture and lower previsions merging
When multiple sources provide possibly unreliable lower previsions modelling theirbeliefs, we must provide rules both to take this unreliability into account and to mergethe different lower previsions into a single one, representing our ﬁnal beliefs.An extreme case of coherent lower prevision is the vacuous prevision
P
v
and itsnatural extension
E
v
, which are such that
E
v
(
g
) = inf
ω
∈X
g
(
ω
)
. It represents a stateof total ignorance about the real value of
X
. Given a coherent lower prevision
P
, itsnatural extension
E
and a scalar
∈
[0
,
1]
, the (coherent) lower prevision
P
that wecall vacuous mixture is such that
P
=
P
+ (1
−
)
P
v
. Its natural extension
E
issuch that
E
(
f
) =
E
(
f
) + (1
−
)inf
ω
∈X
f
(
ω
)
, for any
f
∈ L
(
X
)
and with
E
the natural extension of
P
.
can be interpreted as the probability that the information
P
is reliable,
1
−
being the probability of being ignorant. The vacuous mixture is ageneralise both the the wellknown linearvacuous mixture and the classical discountingrule of belief functions. In terms of credal sets, it is equivalent to compute
P
P
such that
P
P
=
{
p
P
+ (1
−
)
p
v

p
P
∈ P
P
,p
v
∈
P
X
}
.
Now, if we consider
k
coherent lower previsions
P
1
,...,P
k
and their natural extensions
E
1
,...,E
k
, then we can average them into a natural extension
E
σ
by mergingthem through an arithmetic mean, that is by considering
E
σ
(
f
) =
1
k
ki
=1
E
i
(
f
)
forany
f
∈ L
(
X
)
. This rule has been justiﬁed and used by different authors to mergecoherent lower previsions or, equivalently, convex sets of probabilities [10].
2.3 Decision rules
Given some beliefs about a (ﬁnite) variable
X
and a set of preferences, the goal of decision rules is here to select the optimal values
X
can assume, i.e. the class to which
X
may belong. Here, we assume that preferences are modeled, for each
ω
∈ X
, by costfunctions
f
ω
, that is
f
ω
(
ω
)
is the cost of selecting
ω
when
ω
is the true class. Whenuncertainty over
X
is represented by a single probability
p
, the optimal class is the onewhose expected cost is the lowest, i.e.
ω
= argmin
ω
∈X
E
p
(
f
ω
)
, thus taking minimalrisks. If the beliefs about the value of
X
are given by a lower prevision
P
, the classicalexpected cost based decision has to be extended [11].
One way to do so is to still require the decision to be a single class. The most wellknown decision rule in this category is the maximin rule, for which the ﬁnal decision issuch that
ω
= arg min
ω
∈X
E
p
(
f
ω
)
this amounts to minimising the upper expected cost, i.e., the worst possible consequence, and corresponds to a cautious decision. Other possible rules include minimisingthe lower expected cost or minimising a value inbetween.The other way to extend expected cost is to give as decision a set (possibly, butnot necessarily reduced to a singleton) of classes, reﬂecting our indecision and the imprecision of our beliefs. This requires to build, among the possible choices (here, theclasses), a partial ordering, and then to select only the choices that are not dominatedby another one. Two such extensions are the interval ordering
≤
I
and the maximalityordering
≤
M
. Using interval ordering, a choice
ω
is dominated by a choice
ω
, denotedby
ω
≤
I
ω
, iff
E
(
f
ω
)
≤
E
(
f
ω
)
, that is if the upper expected cost of picking
ω
is sureto be lower than the lower expected cost of picking
ω
. The decision set
Ω
I
is then
Ω
I
=
{
ω
∈ X ∃
ω
s.t.
ω
≤
I
ω
}
.
Using maximality ordering, a choice
ω
is dominated by a choice
ω
, denoted by
ω
≤
M
ω
, iff
E
(
f
ω
−
f
ω
)
>
0
. This has the following interpretation: given our beliefs, exchanging
ω
for
ω
would have a strictly positive expected cost, hence we are not readyto do so. The decision set
Ω
M
is then
Ω
M
=
{
ω
∈ X ∃
ω
s.t.
ω
≤
M
ω
}
.
The maximility ordering reﬁnes the Interval ordering and is stronger, in the sense thatwe always have
Ω
M
⊆
Ω
I
. Using these decision rules, the more precise and nonconﬂicting our information is, the smaller is the set of possible classes
Ω
.
3 The method
Let
x
1
,...,
x
N
be N Ddimensional training samples,
Y
=
{
ω
1
,...,ω
M
}
the set of possible classes, and
P
i
:
L
(
Y
)
→
[0
,
1]
be the lower prevision modelling our knowledge about the class to which the sample
x
i
belongs. Given a new instance
x
to classify,that is to which we have to assign a class
y
∈ Y
, we denote by
x
(1)
,...,
x
(
k
)
its
k
ordered nearest neighbours (i.e.
d
(
i
)
< d
(
j
)
if
i
≤
j
). For a given nearest neighbour
x
(
i
)
,the knowledge
P
(
i
)
can be regarded as a piece of evidence related to the unknown classof
x
. However, this piece of knowledge is not
100%
reliable, and should be discountedby a value
i
∈
[0
,
1]
depending of its class, such that, for any
f
∈ L
(
Y
)
,
E
(
i
)
,
x
(
f
) =
(
i
)
E
(
i
)
+ (1
−
(
i
)
) inf
ω
∈Y
f
(
ω
)
.
It seems natural to ask for
be a decreasing function of
d
(
i
)
, since the further away isthe neighbour, the less reliable is the information it provides about the unknown class.Similarly to Denoeux proposal, we can consider the general formula
=
0
φ
(
d
(
i
)
)
,
where
φ
is a nonincreasing function that can be depended of the class given by
x
(
i
)
. Inaddition, the following conditions should hold:
0
<
0
<
1 ;
φ
(0) = 1
and
lim
d
→∞
φ
(
d
) = 0
.
The ﬁrst condition imply that even if the new instance has the same input as onetraining data sample, we do not consider it to be
100%
reliable, as the relation linking the input feature space and the output classes is not necessarily a function. From
P
(1)
,
x
,...,P
(
k
)
,
x
, we then obtain a combined lower prevision
P
such that
P
x
=1
k
k
i
=1
P
(
i
)
,
x
.
Using
P
x
as the ﬁnal uncertainty model for the true class of
x
, one can predict itsﬁnal class, either as a single class by using a maximinlike criteria or as a set of possible classes by using maximality or interval dominance. Using maximality or intervaldominance is a good way to treat both ambiguity or large distances with the nearestneighbours. Indeed, if all nearest neighbours agree on the output class and are close tothe new instance, the obtained lower prevision
P
x
will be precise enough so that thecriteria will end up pointing only one possible class (i.e.,
Ω
M
,
Ω
I
will be singletons).On the contrary, if nearest neighbours disagree or are far from the new instance,
P
x
will be imprecise or indecisive, and
Ω
M
,
Ω
I
will contain several possible classes.
3.1 Using lower previsions to choose
k
A problem when using the knearest neighbour procedure is to choose the "best" number
k
of neighbours to consider. This number is often selected as the one achieving thebest performance in a crossvalidation procedure, but kNN rules can display erraticperformances if
k
is slightly increased or decreased, even if it is by one.Weproposehereanewapproachtoguidethechoiceof
k
,usingthefeaturesoflowerprevisions: we propose to choose the value
k
achieving the best compromise betweenimprecision and precision, estimated respectively from the number of optimal classesselected for each test sample, and from the percentage of times where the true class isinside the set of possible ones.Let
(
x
N
+1
,y
N
+1
)
,...,
(
x
N
+
T
,y
N
+
T
)
be the test samples. Given a value
k
of nearest neighbours, let
Ω
kM,i
denote the set of classes retrieved by maximality criteria for
x
N
+
i
, and
δ
ki
: 2
Y
→ {
0
,
1
}
the function such that
δ
ki
= 1
if
y
N
+
i
∈
Ω
kM,i
and0 otherwise. That is,
δ
ki
is one if the right answer is in the set of possible classes. Then,we propose to estimate the informativeness
Inf
k
and the accuracy
Acc
k
of our
k
NNmethod as:
Inf
k
= 1
−
T i
=1

Ω
kM,i
−
T T
(
M
−
1);
Acc
k
=
T i
=1
δ
ki
T
Note that informativeness has value one iff

Ω
kM,i

= 1
for
i
= 1
,...,T
, that is decisions are precise, while accuracy measures the number of times the right class is in the