10 pages

A k-nearest neighbours method based on lower previsions

of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
A k-nearest neighbours method based on lower previsions
  A K-nearest neighbours method based on lowerprevisions Sebastien Destercke INRA/CIRAD, UMR1208, 2 place P. Viala, F-34060 Montpellier cedex 1, France Abstract. K-nearest neighbours algorithms are among the most popular existingclassification methods, due to their simplicity and good performances. Over theyears, several extensions of the initial method have been proposed. In this paper,we propose a K-nearest neighbours approach that uses the theory of impreciseprobabilities, and more specifically lower previsions. This approach handles verygeneric models when representing imperfect information on the labels of train-ing data, and decision rules developed within this theory allows to deal with is-sues related to the presence of conflicting information or to the absence of closeneighbours. We also show that results of the classical voting K-NN proceduresand distance-weighted k -NN procedures can be retrieved. Keywords : Classification, lower prevision, nearest neighbours. 1 Introduction The k-nearest neighbours (K-NN) classification procedure is an old rule[1] that usesthe notion of similarity and distance with known instances to classify a new one. Givena vector x ∈ R D of input features, a distance d : R D × R D → R and a data set of training samples composed of  N  couples ( x i ,y i ) where x i ∈ R D are feature valuesand y i ∈ Y  = { ω 1 ,...,ω M  } is the class to which belongs the i th sample, the voting k -NN procedure consists in choosing as the class y of  x the one that is in majority inthe k nearest neighbours.One of the main drawback of the srcinal algorithm is that it assumes that the k -nearest neighbors are relatively close to the instance to classify, and can act as reliableinstances to estimate some conditional densities. It also assumes that all classes or pat-terns are well represented in the input feature space, and that the space is well sampled.In practice, this is rarely true, and the distance between a new instance and its nearestneighbour can be large. This makes the way basic k-NN procedure treats the trainingsamples questionable Also, some classes of training samples may only be imperfectlyknown, and this uncertainty should be taken into account.To integrate these various features, many extensions of the initial method have beenproposed: use of weights to account for distance between neighbours and instance toclassification[2]; use of distance and ambiguity rejection, to cope respectively withnearest neighbours whose distance from the instance to classify is too large and withnearestneighboursgivingconflictinginformation[3];useofuncertaintyrepresentationssuch as belief functions to cope with uncertainty [4]. For a detailed survey of the k-NNalgorithm and its different extensions, see[5,Chap. 2].  As far as uncertainty representations are concerned, it can be argued that belief functions do not allow to model precisely all kinds of uncertainties. For example, theyare unable to model exactly uncertainty given by probability intervals (i.e., lower andupper probabilistic bounds given on each class). Imprecise probability theory and wal-ley’s lower previsions[6] are uncertainty models that encompass belief functions asspecial cases. In this sense, they are more general and allow for a finer modelling of uncertainty.In this paper, we propose and discuss a k-NN rule based on the use of Walley’slower prevision [6,7], and of the theory underlying them. As for the TBM k-NN pro-cedure (based on evidence theory and on Dempster’s rule of combintion), it allows totreat all the issues mentioned above without introducing any other parameters than theweights on nearest neighbours, however it does so with a different approach (beingbased on different calculus) and allows the use of more general uncertainty models thanthe TBM. In particular, we argue that using decision rules proper to the lower previsionsapproach allows to take account of ambiguities and distances without having to includeadditional parameters. Using these imprecise decision rules, we also introduce a criteriaallowing to pick the "best" number k of nearest neighbours, balancing imprecision andaccuracy. After recalling the material concerning lower previsions (Section2) neededin this paper, we details the proposed method and its properties (Section3), beforefinishing with some experiments (Section4). 2 Lower previsions This section introduces the very basics about lower previsions and associated toolsneeded in this paper. We refer to Miranda[7] and Walley [6]for more details. 2.1 Basics of lower previsions In this paper, we consider that information regarding a variable X  assuming its valueson a (finite) space X  counting N  exclusive and disjoint elements is modelled by themeans of a so-called coherent lower previsions. We denote by L ( X  ) the set of real-valuedboundedfunctionson X  .Alowerprevision P  : K → R isareal-valuedmappingon a subset K ⊆ L ( X  ) . Given a lower prevision, the dual notion of upper prevision P  isdefined on the set −K = {− f  | f  ∈ K} and is such that P  ( f  ) = − P  ( − f  ) . As discussedby Walley [6], lower previsions can be used to model information about the variable X  .He interprets P  ( f  ) as the supremum buying price for the uncertain reward f  .Given a set A ⊆ X  , its lower probability P  ( A ) is the lower prevision of its indicatorfunction 1 ( A ) , that takes value one on A and zero elsewhere. The upper probability P  ( A ) of  A is the upper prevision of  1 ( A ) , and by duality P  ( A ) = 1 − P  ( A c ) . To alower prevision P  can be associated a convex set P  P  of probabilities, such that P  P  = {  p ∈ P X  | ( ∀ f  ∈ K )( E   p ( f  ) ≥ P  ( f  )) } with P X  thesetofallprobabilitymassfunctionsover P X  and E   p ( f  ) =  x ∈X  p ( x ) f  ( x ) the expected value of  f  given p . As often done, P  P  will be called the credal set of  P  .  A lower prevision is said to avoid sure loss iff  P  P   = ∅ and to be coherent iff itavoids sure loss and ∀ f  ∈ K , P  ( f  ) = min { E   p ( f  ) |  p ∈ P  P  } , i.e. iff  P  is the lowerenvelope of  P  P  . If a lower (upper) prevision is coherent, it corresponds to the lower(upper) expectation of  P  P  . If a lower prevision P  avoids sure loss, its natural extension E  ( g ) to a function g ∈ L ( X  ) is defined as E  ( g ) = min { E   p ( g ) |  p ∈ P  P  } . Note that P  and its natural extension E  coincide on K only when P  is coherent, otherwise P  ≤ E  and P  ( f  ) < E  ( f  ) for at least one f  .Lower previsions are very general uncertainty models, in that they encompass (atleast from a static viewpoint) most of the other known uncertainty models. In partic-ular both necessity measures of possibility theory [8]and belief measures of evidencetheory [9]can be seen as particular lower previsions. 2.2 Vacuous mixture and lower previsions merging When multiple sources provide possibly unreliable lower previsions modelling theirbeliefs, we must provide rules both to take this unreliability into account and to mergethe different lower previsions into a single one, representing our final beliefs.An extreme case of coherent lower prevision is the vacuous prevision P  v and itsnatural extension E  v , which are such that E  v ( g ) = inf  ω ∈X  g ( ω ) . It represents a stateof total ignorance about the real value of  X  . Given a coherent lower prevision P  , itsnatural extension E  and a scalar  ∈ [0 , 1] , the (coherent) lower prevision P   that wecall vacuous mixture is such that P   = P  + (1 −  ) P  v . Its natural extension E   issuch that E   ( f  ) = E  ( f  ) + (1 −  )inf  ω ∈X  f  ( ω ) , for any f  ∈ L ( X  ) and with E  the natural extension of  P  .  can be interpreted as the probability that the information P  is reliable, 1 −  being the probability of being ignorant. The vacuous mixture is ageneralise both the the well-known linear-vacuous mixture and the classical discountingrule of belief functions. In terms of credal sets, it is equivalent to compute P  P   such that P  P   = { p P  + (1 −  )  p v |  p P  ∈ P  P  ,p v ∈ P X  } . Now, if we consider k coherent lower previsions P  1 ,...,P  k and their natural ex-tensions E  1 ,...,E  k , then we can average them into a natural extension E  σ by mergingthem through an arithmetic mean, that is by considering E  σ ( f  ) = 1 k  ki =1 E  i ( f  ) forany f  ∈ L ( X  ) . This rule has been justified and used by different authors to mergecoherent lower previsions or, equivalently, convex sets of probabilities [10]. 2.3 Decision rules Given some beliefs about a (finite) variable X  and a set of preferences, the goal of decision rules is here to select the optimal values X  can assume, i.e. the class to which X  may belong. Here, we assume that preferences are modeled, for each ω ∈ X  , by costfunctions f   ω , that is f   ω ( ω  ) is the cost of selecting ω  when ω is the true class. Whenuncertainty over X  is represented by a single probability p , the optimal class is the onewhose expected cost is the lowest, i.e.  ω = argmin ω ∈X  E   p ( f   ω ) , thus taking minimalrisks. If the beliefs about the value of  X  are given by a lower prevision P  , the classicalexpected cost based decision has to be extended [11].  One way to do so is to still require the decision to be a single class. The most well-known decision rule in this category is the maximin rule, for which the final decision issuch that   ω = arg min ω ∈X  E   p ( f   ω ) this amounts to minimising the upper expected cost, i.e., the worst possible conse-quence, and corresponds to a cautious decision. Other possible rules include minimisingthe lower expected cost or minimising a value in-between.The other way to extend expected cost is to give as decision a set (possibly, butnot necessarily reduced to a singleton) of classes, reflecting our indecision and the im-precision of our beliefs. This requires to build, among the possible choices (here, theclasses), a partial ordering, and then to select only the choices that are not dominatedby another one. Two such extensions are the interval ordering ≤ I  and the maximalityordering ≤ M  . Using interval ordering, a choice ω is dominated by a choice ω  , denotedby ω ≤ I  ω  , iff  E  ( f   ω ) ≤ E  ( f  ω ) , that is if the upper expected cost of picking ω  is sureto be lower than the lower expected cost of picking ω . The decision set  Ω I  is then   Ω I  = { ω ∈ X| ∃ ω  s.t. ω ≤ I  ω  } . Using maximality ordering, a choice ω is dominated by a choice ω  , denoted by ω ≤ M  ω  , iff  E  ( f  ω − f  ω  ) > 0 . This has the following interpretation: given our beliefs, ex-changing ω for ω  would have a strictly positive expected cost, hence we are not readyto do so. The decision set  Ω M  is then   Ω M  = { ω ∈ X| ∃ ω  s.t. ω ≤ M  ω  } . The maximility ordering refines the Interval ordering and is stronger, in the sense thatwe always have  Ω M  ⊆  Ω I  . Using these decision rules, the more precise and non-conflicting our information is, the smaller is the set of possible classes  Ω . 3 The method Let x 1 ,..., x N  be N D-dimensional training samples, Y  = { ω 1 ,...,ω M  } the set of possible classes, and P  i : L ( Y  ) → [0 , 1] be the lower prevision modelling our knowl-edge about the class to which the sample x i belongs. Given a new instance x to classify,that is to which we have to assign a class y ∈ Y  , we denote by x (1) ,..., x ( k ) its k or-dered nearest neighbours (i.e. d ( i ) < d ( j ) if  i ≤ j ). For a given nearest neighbour x ( i ) ,the knowledge P  ( i ) can be regarded as a piece of evidence related to the unknown classof  x . However, this piece of knowledge is not 100% reliable, and should be discountedby a value  i ∈ [0 , 1] depending of its class, such that, for any f  ∈ L ( Y  ) , E  ( i ) , x ( f  ) =  ( i ) E  ( i ) + (1 −  ( i ) ) inf  ω ∈Y  f  ( ω ) . It seems natural to ask for  be a decreasing function of  d ( i ) , since the further away isthe neighbour, the less reliable is the information it provides about the unknown class.Similarly to Denoeux proposal, we can consider the general formula  =  0 φ ( d ( i ) ) ,  where φ is a non-increasing function that can be depended of the class given by x ( i ) . Inaddition, the following conditions should hold: 0 <  0 < 1 ; φ (0) = 1 and lim d →∞ φ ( d ) = 0 . The first condition imply that even if the new instance has the same input as onetraining data sample, we do not consider it to be 100% reliable, as the relation link-ing the input feature space and the output classes is not necessarily a function. From P  (1) , x ,...,P  ( k ) , x , we then obtain a combined lower prevision P  such that P  x =1 k k  i =1 P  ( i ) , x . Using P  x as the final uncertainty model for the true class of  x , one can predict itsfinal class, either as a single class by using a maximin-like criteria or as a set of pos-sible classes by using maximality or interval dominance. Using maximality or intervaldominance is a good way to treat both ambiguity or large distances with the nearestneighbours. Indeed, if all nearest neighbours agree on the output class and are close tothe new instance, the obtained lower prevision P  x will be precise enough so that thecriteria will end up pointing only one possible class (i.e.,  Ω M  ,  Ω I  will be singletons).On the contrary, if nearest neighbours disagree or are far from the new instance, P  x will be imprecise or indecisive, and  Ω M  ,  Ω I  will contain several possible classes. 3.1 Using lower previsions to choose k A problem when using the k-nearest neighbour procedure is to choose the "best" num-ber k of neighbours to consider. This number is often selected as the one achieving thebest performance in a cross-validation procedure, but k-NN rules can display erraticperformances if  k is slightly increased or decreased, even if it is by one.Weproposehereanewapproachtoguidethechoiceof  k ,usingthefeaturesoflowerprevisions: we propose to choose the value k achieving the best compromise betweenimprecision and precision, estimated respectively from the number of optimal classesselected for each test sample, and from the percentage of times where the true class isinside the set of possible ones.Let ( x N  +1 ,y N  +1 ) ,..., ( x N  + T  ,y N  + T  ) be the test samples. Given a value k of nearest neighbours, let Ω kM,i denote the set of classes retrieved by maximality crite-ria for x N  + i , and δ ki : 2 |Y| → { 0 , 1 } the function such that δ ki = 1 if  y N  + i ∈ Ω kM,i and0 otherwise. That is, δ ki is one if the right answer is in the set of possible classes. Then,we propose to estimate the informativeness Inf  k and the accuracy Acc k of our k -NNmethod as: Inf  k = 1 −  T i =1 | Ω kM,i |− T T  ( M  − 1); Acc k =  T i =1 δ ki T  Note that informativeness has value one iff  | Ω kM,i | = 1 for i = 1 ,...,T  , that is deci-sions are precise, while accuracy measures the number of times the right class is in the
Related Documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!