School Work

24 pages

Combining classifiers for word sense disambiguation based on Dempster-Shafer theory and OWA operators

Please download to get full document.

View again

of 24
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Combining classifiers for word sense disambiguation based on Dempster-Shafer theory and OWA operators
  Combining Classifiers for Word SenseDisambiguation Based on Dempster-ShaferTheory and OWA Operators Cuong Anh Le, Van-Nam Huynh, Akira Shimazu,Yoshiteru Nakamori Japan Advanced Institute of Science and Technology 1-1 Asahidai, Tatsunokuchi, Ishikawa 923-1292, Japan  Abstract In this paper, we discuss a framework for weighted combination of classifiers for wordsense disambiguation (WSD). This framework is essentially based on Dempster-Shafer theory of evidence (Dempster, 1967; Shafer, 1976) and ordered weightedaveraging (OWA) operators (Yager, 1988). We first determine various kinds of fea-tures which could provide complementarily linguistic information for the context,and then combine these sources of information based on Dempster’s rule of combi-nation and OWA operators for identifying the meaning of a polysemous word. Weexperimentally design a set of individual classifiers, each of which corresponds to adistinct representation type of context considered in the WSD literature, and thenthe discussed combination strategies are tested and compared on English lexicalsamples of Senseval-2 and Senseval-3. Key words:  Computational linguistics, Classifier combination, Word sensedisambiguation, OWA operator, Evidential reasoning. 1 Introduction The issue of automatic disambiguation of word senses has been an interest andconcern since the 1950s, and is still one of the most important open problems innatural language processing (NLP) [26]. Roughly speaking, word sense disam-biguation involves the association of a given word in a text or discourse with aparticular sense among numerous potential senses of that word. As mentionedin [14], this is an “intermediate task” necessarily to accomplish most NLPtasks such as grammatical analysis and lexicography in linguistic studies, or  machine translation, man-machine communication, message understanding inlanguage understanding applications. Besides these applications, WSD mayalso have potential uses in other applications involving data and knowledgeengineering such as information retrieval, information extraction and text min-ing [1]. More particularly, in information retrieval (IR), the ambiguity has tobe resolved in some queries; for example, given the query “ depression  ”, shouldthe system return documents about illness, weather systems, or economics?Though current IR systems do not use explicitly WSD and rely on the user typ-ing enough context in the query in order to only retrieve documents relevantto the intended sense (e.g. “ tropical depression  ”), early experiments suggestedthat reliable IR would require at least 90% disambiguation accuracy for ex-plicit WSD to be of benefit [30]. In addition, WSD has been more recentlyshown to improve cross-lingual IR and document classification [3,5,33]. Onthe other hand, in information extraction and text mining, WSD is requiredfor the accurate analysis of text in many applications. For instance, an intelli-gence gathering system might require the flagging of all the references illegal drugs  , rather than medical  drugs  . More generally, the Semantic Web requiresautomatic annotation of documents according to a reference ontology: all tex-tual references must be resolved to the right concepts and event structures inthe ontology (see [6]). Named-entity classification, co-reference determination,and acronym expansion can also be cast as WSD problems for proper names.WSD is only beginning to be applied in these areas.Since its inception, many approaches have been proposed for WSD in the lit-erature (see [14] for a survey). During the last two decades, many supervisedmachine learning algorithms have been used for this task, including NaiveBayesian (NB) model, decision trees, exemplar-based model, support vectormachines, maximum entropy models, etc. On the other hand, as observed instudies of pattern recognition systems, although one could choose one of learn-ing systems available based on the analysis of an experimental assessment of these to hopefully achieve the best performance for the pattern recognitionproblem at hand, the set of patterns misclassified by them would not neces-sarily overlap [17]. This means that different classifiers may potentially offercomplementary information about patterns to be classified. In other words,features and classifiers of different types complement one another in classifica-tion performance. This observation highly motivated the interest in combiningclassifiers during the recent years, with particularly application to WSD asin [9,10,13,15,18,28,29,34].As is well-known, there are basically two classifier combination scenarios. Inthe first scenario, all classifiers use the same representation of the input pat-tern, while in the second scenario, each classifier uses its own representationof the input pattern. An important application of combining classifiers inthe second scenario is the possibility to integrate physically different typesof features. In addition, an important issue in combining classifiers is what2  combination strategy should be used to derive a consensus decision. In [17],the authors proposed a common theoretical framework for combining clas-sifiers which leads to many commonly used decision rules used in practice.This framework has been also applied to the problem of WSD in [20,22]. Inthis paper, we focus on the combination of classifiers according to the secondscenario with the discussion being put in the context of WSD. Particularly,we discuss a framework for weighted combination of classifiers in which eachindividual classifier uses a distinct representation of objects to be classified.This framework is based on Dempster-Shafer (DS) theory of evidence [31] andOWA operators [35].In [2], Al-Ani and Deriche have proposed a new technique for combining clas-sifiers using DS theory, in which different classifiers correspond to differentfeature sets. In their approach, the distance between the output classificationvector provided by each single classifier and a reference vector is used to es-timate basic probability assignments (BPAs). These BPAs are then combinedmaking use of Dempster’s rule of combination to obtain a new output vectorthat represents the combined confidence in each class label. Different fromtheir approach, we directly use the output classification vectors of individ-ual classifiers to define the corresponding BPAs, making use of the discountoperation in DS theory and then combine the resulted BPAs to obtain thefinal BPA for making the decision of classification. More particularly, we firstconsider various ways of using context in WSD as distinct representations of apolysemous word under consideration, and then all these representations areused jointly to identify the meaning of the target word. On the one hand,various ways of using the context could be considered as providing differentinformation sources to identify the meaning of the target word. Moreover,each of these information sources does not by itself provide 100% certaintyas a whole piece of evidence for identifying the sense of the target. Then byconsidering the problem as that of weighted combination of evidence for de-cision making, we formulate a general rule of classifier combination based onDS theory of evidence [31], adopting a probabilistic interpretation of weights.This interpretation of weights seems to be appropriate when defining weightsin terms of the accuracy of individual classifiers.On the other hand, by considering each representation of the context as in-formation inspired by a semantics or syntactical criterion for the purpose of word sense identification, we can apply OWA operators for aggregating multi-criteria to form an overall decision function considered as the fuzzy majoritybased voting strategy. It should be worth mentioning that the use of OWAoperators in classifier combination has been studied, for example, in [19]. Inthis paper, however, we use OWA operators for classifier fusion in their seman-tic relation to linguistic quantifiers so that we could provide a framework forcombining classifiers, which also yields several commonly used decision rulesbut without some strong assumptions made in the work by Kittler et al. [17].3  Experimentally, we design a set of individual classifiers, each of which corre-sponds to a distinct representation type of context considered in the WSDliterature, and then the proposed combination strategies are experimentallytested on English lexical samples of Senseval-2 and Senseval-3. The rest of thispaper is organized as follows. In Section 2, we will recall basic notions fromDempster-Shafer theory of evidence and OWA operators. Section 3 devotesto the theoretical framework for combining classifiers in WSD based on thesetheories. Then an experimental study will be conducted in Section 4. Finally,Section 5 presents some concluding remarks. 2 Preliminaries In this section we briefly review basic notions of DS theory of evidence andOWA operators. 2.1 Dempster-Shafer Theory of Evidence  In DS theory, a problem domain is represented by a finite set Θ of mutuallyexclusive and exhaustive hypotheses, called  frame of discernment   [31]. In thestandard probability framework, all elements in Θ are assigned a probability.And when the degree of support for an event is known, the remainder of thesupport is automatically assigned to the negation of the event. On the otherhand, in DS theory mass assignments are carried out for events as they know,and committing support for an event does not necessarily imply that theremaining support is committed to its negation. Formally, a basic probabilityassignment (BPA, for short) is a function  m  : 2 Θ →  [0 , 1] verifying m ( ∅ ) = 0 ,  and  A ∈ 2 Θ m ( A ) = 1The quantity  m ( A ) can be interpreted as a measure of the belief that is com-mitted exactly to  A , given the available evidence. A subset  A  ∈  2 Θ with m ( A )  >  0 is called a  focal element   of   m.  A BPA  m  is called to be  vacuous   if  m (Θ) = 1 and  m ( A ) = 0 for all  A   = Θ . Two evidential functions derived from the basic probability assignment  m  arethe belief function  Bel m  and the plausibility function  Pl m , defined as Bel m ( A ) =  ∅ = B ⊆ A m ( B ) ,  and  Pl m ( A ) =  B ∩ A  = ∅ m ( B )The difference between  m ( A ) and  Bel m ( A ) is that while  m ( A ) is our belief 4  committed to the subset  A  excluding any of its proper subsets,  Bel m ( A ) isour degree of belief in  A  as well as all of its subsets. Consequently,  Pl m ( A )represents the degree to which the evidence fails to refute  A . Note that all thethree functions are in an one-to-one correspondence with each other.Two useful operations that play a central role in the manipulation of belief functions are  discounting   and  Dempster’s rule of combination   [31]. The dis-counting operation is used when a source of information provides a BPA  m ,but one knows that this source has probability  α  of reliability. Then one mayadopt (1 − α ) as one’s  discount rate  , which results in a new BPA  m α definedby m α ( A )= αm ( A ) ,  for any  A  ⊂  Θ (1) m α (Θ)=(1 − α ) + αm (Θ) (2)Consider now two pieces of evidence on the same frame Θ represented by twoBPAs  m 1  and  m 2 . Dempster’s rule of combination is then used to generate anew BPA, denoted by ( m 1  ⊕  m 2 ) (also called the orthogonal sum of   m 1  and m 2 ), defined as follows( m 1  ⊕ m 2 )( ∅ ) = 0 , ( m 1  ⊕ m 2 )( A ) =  11 − κ  B ∩ C  = A m 1 ( B ) m 2 ( C  )(3)where κ  =  B ∩ C  = ∅ m 1 ( B ) m 2 ( C  ) (4)Note that the orthogonal sum combination is only applicable to such two BPAsthat verify the condition  κ <  1. 2.2 OWA Operators  The notion of OWA operators was first introduced in [35] regarding the prob-lem of aggregating multi-criteria to form an overall decision function. A map-ping F   : [0 , 1] n →  [0 , 1]is called an OWA operator of dimension  n  if it is associated with a weightingvector  W   = [ w 1 ,...,w n ], such that 1)  w i  ∈  [0 , 1] and 2)   i w i  = 1, and F  ( a 1 ,...,a n ) = n  i =1 w i b i where  b i  is the  i -th largest element in the collection  a 1 ,...,a n .5
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!