23 pages

A divisive information-theoretic feature clustering algorithm for text classification

of 23
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
A divisive information-theoretic feature clustering algorithm for text classification
  Journal of Machine Learning Research 3 (2003) 1265-1287 Submitted 5/02; Published 3/03 A Divisive Information-Theoretic Feature ClusteringAlgorithm for Text Classification Inderjit S. Dhillon  INDERJIT @ CS . UTEXAS . EDU Subramanyam Mallela  MANYAM @ CS . UTEXAS . EDU Rahul Kumar  RAHUL @ CS . UTEXAS . EDU  Department of Computer SciencesUniversity of Texas, Austin Editors:  Isabelle Guyon and Andr´e Elisseeff  Abstract High dimensionalityof text can be a deterrentin applyingcomplexlearnerssuch as SupportVectorMachines to the task of text classification. Feature clustering is a powerful alternative to featureselection for reducing the dimensionality of text data. In this paper we propose a new information-theoretic divisive algorithm for feature/word clustering and apply it to text classification. Existingtechniques for such “distributional clustering” of words are agglomerativein nature and result in (i)sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the opti-mality of word clusters in an information theoretic framework, we first derive a global criterion forfeature clustering. We then present a fast, divisive algorithm that monotonically decreases this ob- jective function value. We show that our algorithm minimizes the “within-cluster Jensen-Shannondivergence” while simultaneously maximizing the “between-cluster Jensen-Shannon divergence”.In comparison to the previously proposed agglomerative strategies our divisive algorithm is muchfaster and achieves comparable or higher classification accuracies. We further show that featureclustering is an effective technique for building smaller class models in hierarchical classification.We present detailed experimental results using Naive Bayes and Support Vector Machines on the20Newsgroups data set and a 3-level hierarchy of HTML documents collected from the Open Di-rectory project ( www.dmoz.org ). Keywords:  Information theory, Feature Clustering, Classification, Entropy, Kullback-Leibler Di-vergence, Mutual Information, Jensen-Shannon Divergence. 1. Introduction Givenaset ofdocument vectors  { d  1 , d  2 ,... , d  n } and their associated class labels  c ( d  i ) ∈{ c 1 , c 2 ,... , c l } ,text classification is the problem of estimating the true class label of a new document  d  . There ex-ist a wide variety of algorithms for text classification, ranging from the simple but effective NaiveBayes algorithm to the more computationally demanding Support Vector Machines (Mitchell, 1997,Vapnik, 1995, Yang and Liu, 1999).A common, and often overwhelming, characteristic of text data is its extremely high dimension-ality. Typically the document vectors areformed using avector-space orbag-of-words model (Saltonand McGill, 1983). Even a moderately sized document collection can lead to a dimensionality inthousands. For example, one of our test data sets contains 5,000 web pages from www.dmoz.org and has a dimensionality (vocabulary size after pruning) of 14,538. This high dimensionality canbe a severe obstacle for classification algorithms based on Support Vector Machines, Linear Dis- c  2003 Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar.  D HILLON , M ALLELA AND  K UMAR criminant Analysis,  k  -nearest neighbor etc. The problem is compounded when the documents arearranged in a hierarchy of classes and a full-feature classifier is applied at each node of the hierarchy.A way to reduce dimensionality is by the distributional clustering of words/features (Pereiraet al., 1993, Baker and McCallum, 1998, Slonim and Tishby, 2001). Each word cluster can then betreated as a single feature and thus dimensionality can be drastically reduced. As shown by Bakerand McCallum (1998), Slonim and Tishby (2001), such feature clustering is more effective thanfeature selection(Yang and Pedersen, 1997), especially at lower number of features. Also, evenwhen dimensionality is reduced by as much as two orders of magnitude the resulting classifica-tion accuracy is similar to that of a full-feature classifier. Indeed in some cases of small trainingsets and noisy features, word clustering can actually increase classification accuracy. However thealgorithms developed by both Baker and McCallum (1998) and Slonim and Tishby (2001) are ag-glomerative in nature making a greedy move at every step and thus yield sub-optimal word clustersat a high computational cost.In this paper, we use an information-theoretic framework that is similar to Information Bottle-neck (see Chapter 2, Problem 22 of Cover and Thomas, 1991, Tishby et al., 1999) to derive a globalcriterion that captures the optimality of word clustering (see Theorem 1). Our global criterion isbased on the generalized Jensen-Shannon divergence (Lin, 1991) among multiple probability dis-tributions. In order to find the best word clustering, i.e., the clustering that minimizes this objectivefunction, we present a new divisive algorithm for clustering words. This algorithm is reminiscent of the  k  -means algorithm but uses Kullback Leibler divergences (Kullback and Leibler, 1951) insteadof squared Euclidean distances. We prove that our divisive algorithm  monotonically  decreases theobjective function value. We also show that our algorithm minimizes “within-cluster divergence”and simultaneously maximizes “between-cluster divergence”. Thus we find word clusters that aremarkedly better than the agglomerative algorithms of Baker and McCallum (1998) and Slonim andTishby (2001). The increased quality of our word clusters translates to higher classification accura-cies, especially at small feature sizes and small training sets. We provide empirical evidence of allthe above claims using Naive Bayes and Support Vector Machines on the (a) 20 Newsgroups dataset, and (b) an HTML data set comprising 5,000 web pages arranged in a 3-level hierarchy from theOpen Directory Project ( www.dmoz.org ).We now give a brief outline of the paper. In Section 2, we discuss related work and contrast itwith our work. In Section 3 we briefly review some useful concepts from information theory such asKullback-Leibler(KL) divergence and Jensen-Shannon(JS) divergence, while inSection 4wereviewtext classifiers based on Naive Bayes and Support Vector Machines. Section 5 poses the questionof finding optimal word clusters in terms of preserving mutual information between two randomvariables. Section 5.1 gives the algorithm that directly minimizes the resulting objective functionwhich is based on KL-divergences, and presents some pleasing aspects of the algorithm, such asconvergence and simultaneous maximization of “between-cluster JS-divergence”. In Section 6 wepresent experimental results that highlight the benefits of our word clustering, and the resultingincrease in classification accuracy. Finally, we present our conclusions in Section 7.A word about notation: upper-case letters such as  X  ,  Y  ,  C  ,  W   will denote random variables,while script upper-case letters such as  X  ,  Y   ,  C  ,  W   denote sets. Individual set elements will oftenbe denoted by lower-case letters such as  x ,  w  or  x i ,  w t  . Probability distributions will be denoted by  p ,  q ,  p 1 ,  p 2 , etc. when the random variable is obvious or by  p (  X  ) ,  p ( C  | w t  )  to make the randomvariable explicit. We use logarithms to the base 2. 1266  I NFORMATION -T HEORETIC  F EATURE  C LUSTERING FOR  T EXT  C LASSIFICATION 2. Related Work Text classification has been extensively studied, especially since the emergence of the internet. Mostalgorithms are based on the bag-of-words model for text (Salton and McGill, 1983). A simple buteffective algorithm is the Naive Bayes method (Mitchell, 1997). For text classification, differentvariants of Naive Bayes have been used, but McCallum and Nigam (1998) showed that the vari-ant based on the multinomial model leads to better results. Support Vector Machines have alsobeen successfully used for text classification (Joachims, 1998, Dumais et al., 1998). For hierar-chical text data, such as the topic hierarchies of Yahoo! ( www.yahoo.com ) and the Open DirectoryProject ( www.dmoz.org ), hierarchical classification has been studied by Koller and Sahami (1997),Chakrabarti et al. (1997), Dumais and Chen (2000). For some more details, see Section 4.Tocounter high-dimensionality various methods offeature selection havebeen proposed byYangand Pedersen (1997), Koller and Sahami (1997) and Chakrabarti et al. (1997). Distributional clus-tering of words has proven to be more effective than feature selection in text classification and wasfirst proposed by Pereira, Tishby, and Lee (1993) where “soft” distributional clustering was usedto cluster nouns according to their conditional verb distributions. Note that since our main goal isto reduce the number of features  and   the model size, we are only interested in “hard clustering”where each word can be represented by its unique word cluster. For text classification, Baker andMcCallum (1998) used such hard clustering, while more recently, Slonim and Tishby (2001) haveused the Information Bottleneck method for clustering words. Both Baker and McCallum (1998)and Slonim and Tishby (2001) use similar agglomerative clustering strategies that make a greedymove at every agglomeration, and show that feature size can be aggressively reduced by such clus-tering without any noticeable loss in classification accuracy using Naive Bayes. Similar results havebeen reported for Support Vector Machines (Bekkerman et al., 2001). To select the number of wordclusters to be used for the classification task, Verbeek (2000) has applied the Minimum DescriptionLength (MDL) principle (Rissanen, 1989) to the agglomerative algorithm of Slonim and Tishby(2001).Two other dimensionality/feature reduction schemes are used in latent semantic indexing (LSI)(Deerwester et al., 1990) and its probabilistic version (Hofmann, 1999). Typically these methodshave been applied in the  unsupervised   setting and as shown by Baker and McCallum (1998), LSIresults in lower classification accuracies than feature clustering.We now list the main contributions of this paper and contrast them with earlier work. As ourfirst contribution, we use an information-theoretic framework to derive a global objective functionthat explicitly captures the optimality of word clusters in terms of the generalized Jensen-Shannondivergence between multiple probability distributions. As our second contribution, we present adivisive algorithm that uses Kullback-Leibler divergence as the distance measure, and explicitlyminimizes the global objective function. This is in contrast to Slonim and Tishby (2001) whoconsidered the merging of   just two  word clusters at every step and derived a local criterion basedon the Jensen-Shannon divergence of   two  probability distributions. Their agglomerative algorithm,which is similar to the algorithm of Baker and McCallum (1998), greedily optimizes this mergingcriterion (see Section 5.3 formoredetails). Thus, their resulting algorithm does not directly optimizea global criterion and is computationally expensive — the algorithm of Slonim and Tishby (2001)is  O ( m 3 l )  in complexity where  m  is the total number of words and  l  is the number of classes.In contrast the complexity of our divisive algorithm is  O ( mkl τ )  where  k   is the number of wordclusters (typically  k   ≪  m ), and  τ  is the number of iterations (typically  τ  =  15 on average). Note 1267  D HILLON , M ALLELA AND  K UMAR that our hard clustering leads to a model size of   O ( k  ) , whereas “soft” clustering in methods suchas probabilistic LSI (Hofmann, 1999) leads to a model size of   O ( mk  ) . Finally, we show that ourenhanced word clustering leads to higher classification accuracy, especially when the training set issmall and in hierarchical classification of HTML data. 3. Some Concepts from Information Theory Inthis section, wequickly review someconcepts from information theory whichwillbeused heavilyin this paper. For more details on some of this material see the authoritative treatment in the book by Cover and Thomas (1991).Let  X   be a discrete random variable that takes on values from the set  X   with probability distri-bution  p (  x ) . The entropy of X (Shannon, 1948) is defined as  H  (  p ) =  −  ∑  x ∈ X   p (  x ) log  p (  x )  . The relative entropy or Kullback-Leibler(KL) divergence (Kullback and Leibler, 1951) between twoprobability distributions  p 1 (  x )  and  p 2 (  x )  is defined as KL (  p 1 ,  p 2 ) =  ∑  x ∈ X   p 1 (  x ) log  p 1 (  x )  p 2 (  x )  . KL-divergence is a measure of the “distance” between two probability distributions; however it isnot a true metric since it is not symmetric and does not obey the triangle inequality (Cover andThomas, 1991, p.18). KL-divergence is always non-negative but can be unbounded; in particularwhen  p 1 (  x )   =  0 and  p 2 (  x ) =  0,  KL (  p 1 ,  p 2 ) =  ∞ . In contrast, the Jensen-Shannon(JS) divergencebetween  p 1  and  p 2  defined by  JS  π (  p 1 ,  p 2 ) =  π 1 KL (  p 1 , π 1  p 1 + π 2  p 2 )+ π 2 KL (  p 2 , π 1  p 1  + π 2  p 2 )=  H  ( π 1  p 1  + π 2  p 2 ) − π 1  H  (  p 1 ) − π 2  H  (  p 2 )  , where  π 1  + π 2  =  1,  π i  ≥  0, is clearly a measure that is symmetric in  { π 1 ,  p 1 }  and  { π 2 ,  p 2 } , and isbounded (Lin, 1991). The Jensen-Shannon divergence can be generalized to measure the distancebetween any finite number of probability distributions as:  JS  π ( {  p i  : 1  ≤  i  ≤  n } ) =  H    n ∑ i = 1 π i  p i  − n ∑ i = 1 π i  H  (  p i )  ,  (1)which is symmetric in the  { π i ,  p i } ’s ( ∑ i π i  =  1 , π i  ≥  0).Let  Y   be another random variable with probability distribution  p (  y ) . The mutual informationbetween X and Y,  I  (  X  ; Y  ) , is defined as the KL-divergence between the joint probability distribution  p (  x ,  y )  and the product distribution  p (  x )  p (  y ) :  I  (  X  ; Y  ) =  ∑  x ∑  y  p (  x ,  y ) log  p (  x ,  y )  p (  x )  p (  y )  .  (2)Intuitively, mutual information is a measure of the amount of information that one random variablecontains about the other. The higher its value the less is the uncertainty of one random variable dueto knowledge about the other. Formally, it can be shown that  I  (  X  ; Y  )  is the reduction in entropy of one variable knowing the other:  I  (  X  ; Y  ) =  H  (  X  ) −  H  (  X  | Y  ) =  H  ( Y  ) −  H  ( Y  |  X  )  (Cover and Thomas,1991). 1268  I NFORMATION -T HEORETIC  F EATURE  C LUSTERING FOR  T EXT  C LASSIFICATION 4. Text Classification Two contrasting classifiers that perform well on text classification are (i) the simple Naive Bayesmethod and (ii) the more complex Support Vector Machines. 4.1 Naive Bayes Classifier Let C   = { c 1 , c 2 ,... , c l }  be the set of   l  classes, and let W   = { w 1 ,... , w m }  be the set of words/featurescontained inthese classes. Givenanew document  d  , the probability that d   belongs to class  c i  is givenby Bayes rule,  p ( c i | d  ) =  p ( d  | c i )  p ( c i )  p ( d  )  . Assuming a generative multinomial model (McCallum and Nigam, 1998) and further assumingclass-conditional independence of words yields the well-known Naive Bayes classifier (Mitchell,1997), which computes the most probable class for  d   as c ∗ ( d  ) =  argmax c i  p ( c i | d  ) =  argmax c i  p ( c i ) m ∏ t  = 1  p ( w t  | c i ) n ( w t  , d  ) (3)where  n ( w t  , d  )  is the number of occurrences of word  w t   in document  d  , and the quantities  p ( w t  | c i ) are usually estimated using Laplace’s rule of succession:  p ( w t  | c i ) = 1 + ∑ d   j ∈ c i  n ( w t  , d   j ) m + ∑ mt  = 1 ∑ d   j ∈ c i  n ( w t  , d   j )  .  (4)The class priors  p ( c i )  are estimated by the maximum likelihood estimate  p ( c i ) =  | c i | ∑  j | c  j | . We nowmanipulate the Naive Bayes rule in order to interpret it in an information theoretic framework.Rewrite formula (3) by taking logarithms and dividing by the length of the document  | d  |  to get c ∗ ( d  ) =  argmax c i  log  p ( c i ) | d  |  + m ∑ t  = 1  p ( w t  | d  ) log  p ( w t  | c i )   ,  (5)wherethe document  d   maybeviewedasaprobability distribution overwords:  p ( w t  | d  ) = n ( w t  , d  ) / | d  | .Adding the entropy of   p ( W  | d  ) , i.e.,  − ∑ mt  = 1  p ( w t  | d  ) log  p ( w t  | d  )  to (5), and negating, we get c ∗ ( d  ) =  argmin c i   m ∑ t  = 1  p ( w t  | d  ) log  p ( w t  | d  )  p ( w t  | c i )  −  log  p ( c i ) | d  |   (6) =  argmin c i  KL (  p ( W  | d  ) ,  p ( W  | c i )) −  log  p ( c i ) | d  |   , where  KL (  p , q )  denotes the KL-divergence between  p  and  q  as defined in Section 3. Note thathere we have used  W   to denote the random variable that takes values from the set of words  W  .Thus, assuming equal class priors, we see that Naive Bayes may be interpreted as finding the classdistribution which has minimum KL-divergence from the given document. As we shall see againlater, KL-divergence seems to appear “naturally” in our setting.By (5), we can clearly see that Naive Bayes is a linear classifier. Despite its crude assumptionabout the class-conditional independence of words, Naive Bayes has been found to yield surpris-ingly good classification performance, especially on text data. Plausible reasons for the success of Naive Bayes have been explored by Domingos and Pazzani (1997), Friedman (1997). 1269
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!