Journal of Machine Learning Research 3 (2003) 12651287 Submitted 5/02; Published 3/03
A Divisive InformationTheoretic Feature ClusteringAlgorithm for Text Classiﬁcation
Inderjit S. Dhillon
INDERJIT
@
CS
.
UTEXAS
.
EDU
Subramanyam Mallela
MANYAM
@
CS
.
UTEXAS
.
EDU
Rahul Kumar
RAHUL
@
CS
.
UTEXAS
.
EDU
Department of Computer SciencesUniversity of Texas, Austin
Editors:
Isabelle Guyon and Andr´e Elisseeff
Abstract
High dimensionalityof text can be a deterrentin applyingcomplexlearnerssuch as SupportVectorMachines to the task of text classiﬁcation. Feature clustering is a powerful alternative to featureselection for reducing the dimensionality of text data. In this paper we propose a new informationtheoretic divisive algorithm for feature/word clustering and apply it to text classiﬁcation. Existingtechniques for such “distributional clustering” of words are agglomerativein nature and result in (i)suboptimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we ﬁrst derive a global criterion forfeature clustering. We then present a fast, divisive algorithm that monotonically decreases this ob jective function value. We show that our algorithm minimizes the “withincluster JensenShannondivergence” while simultaneously maximizing the “betweencluster JensenShannon divergence”.In comparison to the previously proposed agglomerative strategies our divisive algorithm is muchfaster and achieves comparable or higher classiﬁcation accuracies. We further show that featureclustering is an effective technique for building smaller class models in hierarchical classiﬁcation.We present detailed experimental results using Naive Bayes and Support Vector Machines on the20Newsgroups data set and a 3level hierarchy of HTML documents collected from the Open Directory project (
www.dmoz.org
).
Keywords:
Information theory, Feature Clustering, Classiﬁcation, Entropy, KullbackLeibler Divergence, Mutual Information, JensenShannon Divergence.
1. Introduction
Givenaset ofdocument vectors
{
d
1
,
d
2
,... ,
d
n
}
and their associated class labels
c
(
d
i
)
∈{
c
1
,
c
2
,... ,
c
l
}
,text classiﬁcation is the problem of estimating the true class label of a new document
d
. There exist a wide variety of algorithms for text classiﬁcation, ranging from the simple but effective NaiveBayes algorithm to the more computationally demanding Support Vector Machines (Mitchell, 1997,Vapnik, 1995, Yang and Liu, 1999).A common, and often overwhelming, characteristic of text data is its extremely high dimensionality. Typically the document vectors areformed using avectorspace orbagofwords model (Saltonand McGill, 1983). Even a moderately sized document collection can lead to a dimensionality inthousands. For example, one of our test data sets contains 5,000 web pages from
www.dmoz.org
and has a dimensionality (vocabulary size after pruning) of 14,538. This high dimensionality canbe a severe obstacle for classiﬁcation algorithms based on Support Vector Machines, Linear Dis
c
2003 Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar.
D
HILLON
, M
ALLELA AND
K
UMAR
criminant Analysis,
k
nearest neighbor etc. The problem is compounded when the documents arearranged in a hierarchy of classes and a fullfeature classiﬁer is applied at each node of the hierarchy.A way to reduce dimensionality is by the distributional clustering of words/features (Pereiraet al., 1993, Baker and McCallum, 1998, Slonim and Tishby, 2001). Each word cluster can then betreated as a single feature and thus dimensionality can be drastically reduced. As shown by Bakerand McCallum (1998), Slonim and Tishby (2001), such feature clustering is more effective thanfeature selection(Yang and Pedersen, 1997), especially at lower number of features. Also, evenwhen dimensionality is reduced by as much as two orders of magnitude the resulting classiﬁcation accuracy is similar to that of a fullfeature classiﬁer. Indeed in some cases of small trainingsets and noisy features, word clustering can actually increase classiﬁcation accuracy. However thealgorithms developed by both Baker and McCallum (1998) and Slonim and Tishby (2001) are agglomerative in nature making a greedy move at every step and thus yield suboptimal word clustersat a high computational cost.In this paper, we use an informationtheoretic framework that is similar to Information Bottleneck (see Chapter 2, Problem 22 of Cover and Thomas, 1991, Tishby et al., 1999) to derive a globalcriterion that captures the optimality of word clustering (see Theorem 1). Our global criterion isbased on the generalized JensenShannon divergence (Lin, 1991) among multiple probability distributions. In order to ﬁnd the best word clustering, i.e., the clustering that minimizes this objectivefunction, we present a new divisive algorithm for clustering words. This algorithm is reminiscent of the
k
means algorithm but uses Kullback Leibler divergences (Kullback and Leibler, 1951) insteadof squared Euclidean distances. We prove that our divisive algorithm
monotonically
decreases theobjective function value. We also show that our algorithm minimizes “withincluster divergence”and simultaneously maximizes “betweencluster divergence”. Thus we ﬁnd word clusters that aremarkedly better than the agglomerative algorithms of Baker and McCallum (1998) and Slonim andTishby (2001). The increased quality of our word clusters translates to higher classiﬁcation accuracies, especially at small feature sizes and small training sets. We provide empirical evidence of allthe above claims using Naive Bayes and Support Vector Machines on the (a) 20 Newsgroups dataset, and (b) an HTML data set comprising 5,000 web pages arranged in a 3level hierarchy from theOpen Directory Project (
www.dmoz.org
).We now give a brief outline of the paper. In Section 2, we discuss related work and contrast itwith our work. In Section 3 we brieﬂy review some useful concepts from information theory such asKullbackLeibler(KL) divergence and JensenShannon(JS) divergence, while inSection 4wereviewtext classiﬁers based on Naive Bayes and Support Vector Machines. Section 5 poses the questionof ﬁnding optimal word clusters in terms of preserving mutual information between two randomvariables. Section 5.1 gives the algorithm that directly minimizes the resulting objective functionwhich is based on KLdivergences, and presents some pleasing aspects of the algorithm, such asconvergence and simultaneous maximization of “betweencluster JSdivergence”. In Section 6 wepresent experimental results that highlight the beneﬁts of our word clustering, and the resultingincrease in classiﬁcation accuracy. Finally, we present our conclusions in Section 7.A word about notation: uppercase letters such as
X
,
Y
,
C
,
W
will denote random variables,while script uppercase letters such as
X
,
Y
,
C
,
W
denote sets. Individual set elements will oftenbe denoted by lowercase letters such as
x
,
w
or
x
i
,
w
t
. Probability distributions will be denoted by
p
,
q
,
p
1
,
p
2
, etc. when the random variable is obvious or by
p
(
X
)
,
p
(
C

w
t
)
to make the randomvariable explicit. We use logarithms to the base 2.
1266
I
NFORMATION
T
HEORETIC
F
EATURE
C
LUSTERING FOR
T
EXT
C
LASSIFICATION
2. Related Work
Text classiﬁcation has been extensively studied, especially since the emergence of the internet. Mostalgorithms are based on the bagofwords model for text (Salton and McGill, 1983). A simple buteffective algorithm is the Naive Bayes method (Mitchell, 1997). For text classiﬁcation, differentvariants of Naive Bayes have been used, but McCallum and Nigam (1998) showed that the variant based on the multinomial model leads to better results. Support Vector Machines have alsobeen successfully used for text classiﬁcation (Joachims, 1998, Dumais et al., 1998). For hierarchical text data, such as the topic hierarchies of Yahoo! (
www.yahoo.com
) and the Open DirectoryProject (
www.dmoz.org
), hierarchical classiﬁcation has been studied by Koller and Sahami (1997),Chakrabarti et al. (1997), Dumais and Chen (2000). For some more details, see Section 4.Tocounter highdimensionality various methods offeature selection havebeen proposed byYangand Pedersen (1997), Koller and Sahami (1997) and Chakrabarti et al. (1997). Distributional clustering of words has proven to be more effective than feature selection in text classiﬁcation and wasﬁrst proposed by Pereira, Tishby, and Lee (1993) where “soft” distributional clustering was usedto cluster nouns according to their conditional verb distributions. Note that since our main goal isto reduce the number of features
and
the model size, we are only interested in “hard clustering”where each word can be represented by its unique word cluster. For text classiﬁcation, Baker andMcCallum (1998) used such hard clustering, while more recently, Slonim and Tishby (2001) haveused the Information Bottleneck method for clustering words. Both Baker and McCallum (1998)and Slonim and Tishby (2001) use similar agglomerative clustering strategies that make a greedymove at every agglomeration, and show that feature size can be aggressively reduced by such clustering without any noticeable loss in classiﬁcation accuracy using Naive Bayes. Similar results havebeen reported for Support Vector Machines (Bekkerman et al., 2001). To select the number of wordclusters to be used for the classiﬁcation task, Verbeek (2000) has applied the Minimum DescriptionLength (MDL) principle (Rissanen, 1989) to the agglomerative algorithm of Slonim and Tishby(2001).Two other dimensionality/feature reduction schemes are used in latent semantic indexing (LSI)(Deerwester et al., 1990) and its probabilistic version (Hofmann, 1999). Typically these methodshave been applied in the
unsupervised
setting and as shown by Baker and McCallum (1998), LSIresults in lower classiﬁcation accuracies than feature clustering.We now list the main contributions of this paper and contrast them with earlier work. As ourﬁrst contribution, we use an informationtheoretic framework to derive a global objective functionthat explicitly captures the optimality of word clusters in terms of the generalized JensenShannondivergence between multiple probability distributions. As our second contribution, we present adivisive algorithm that uses KullbackLeibler divergence as the distance measure, and explicitlyminimizes the global objective function. This is in contrast to Slonim and Tishby (2001) whoconsidered the merging of
just two
word clusters at every step and derived a local criterion basedon the JensenShannon divergence of
two
probability distributions. Their agglomerative algorithm,which is similar to the algorithm of Baker and McCallum (1998), greedily optimizes this mergingcriterion (see Section 5.3 formoredetails). Thus, their resulting algorithm does not directly optimizea global criterion and is computationally expensive — the algorithm of Slonim and Tishby (2001)is
O
(
m
3
l
)
in complexity where
m
is the total number of words and
l
is the number of classes.In contrast the complexity of our divisive algorithm is
O
(
mkl
τ
)
where
k
is the number of wordclusters (typically
k
≪
m
), and
τ
is the number of iterations (typically
τ
=
15 on average). Note
1267
D
HILLON
, M
ALLELA AND
K
UMAR
that our hard clustering leads to a model size of
O
(
k
)
, whereas “soft” clustering in methods suchas probabilistic LSI (Hofmann, 1999) leads to a model size of
O
(
mk
)
. Finally, we show that ourenhanced word clustering leads to higher classiﬁcation accuracy, especially when the training set issmall and in hierarchical classiﬁcation of HTML data.
3. Some Concepts from Information Theory
Inthis section, wequickly review someconcepts from information theory whichwillbeused heavilyin this paper. For more details on some of this material see the authoritative treatment in the book by Cover and Thomas (1991).Let
X
be a discrete random variable that takes on values from the set
X
with probability distribution
p
(
x
)
. The entropy of X (Shannon, 1948) is deﬁned as
H
(
p
) =
−
∑
x
∈
X
p
(
x
)
log
p
(
x
)
.
The relative entropy or KullbackLeibler(KL) divergence (Kullback and Leibler, 1951) between twoprobability distributions
p
1
(
x
)
and
p
2
(
x
)
is deﬁned as
KL
(
p
1
,
p
2
) =
∑
x
∈
X
p
1
(
x
)
log
p
1
(
x
)
p
2
(
x
)
.
KLdivergence is a measure of the “distance” between two probability distributions; however it isnot a true metric since it is not symmetric and does not obey the triangle inequality (Cover andThomas, 1991, p.18). KLdivergence is always nonnegative but can be unbounded; in particularwhen
p
1
(
x
)
=
0 and
p
2
(
x
) =
0,
KL
(
p
1
,
p
2
) =
∞
. In contrast, the JensenShannon(JS) divergencebetween
p
1
and
p
2
deﬁned by
JS
π
(
p
1
,
p
2
) =
π
1
KL
(
p
1
,
π
1
p
1
+
π
2
p
2
)+
π
2
KL
(
p
2
,
π
1
p
1
+
π
2
p
2
)=
H
(
π
1
p
1
+
π
2
p
2
)
−
π
1
H
(
p
1
)
−
π
2
H
(
p
2
)
,
where
π
1
+
π
2
=
1,
π
i
≥
0, is clearly a measure that is symmetric in
{
π
1
,
p
1
}
and
{
π
2
,
p
2
}
, and isbounded (Lin, 1991). The JensenShannon divergence can be generalized to measure the distancebetween any ﬁnite number of probability distributions as:
JS
π
(
{
p
i
: 1
≤
i
≤
n
}
) =
H
n
∑
i
=
1
π
i
p
i
−
n
∑
i
=
1
π
i
H
(
p
i
)
,
(1)which is symmetric in the
{
π
i
,
p
i
}
’s (
∑
i
π
i
=
1
,
π
i
≥
0).Let
Y
be another random variable with probability distribution
p
(
y
)
. The mutual informationbetween X and Y,
I
(
X
;
Y
)
, is deﬁned as the KLdivergence between the joint probability distribution
p
(
x
,
y
)
and the product distribution
p
(
x
)
p
(
y
)
:
I
(
X
;
Y
) =
∑
x
∑
y
p
(
x
,
y
)
log
p
(
x
,
y
)
p
(
x
)
p
(
y
)
.
(2)Intuitively, mutual information is a measure of the amount of information that one random variablecontains about the other. The higher its value the less is the uncertainty of one random variable dueto knowledge about the other. Formally, it can be shown that
I
(
X
;
Y
)
is the reduction in entropy of one variable knowing the other:
I
(
X
;
Y
) =
H
(
X
)
−
H
(
X

Y
) =
H
(
Y
)
−
H
(
Y

X
)
(Cover and Thomas,1991).
1268
I
NFORMATION
T
HEORETIC
F
EATURE
C
LUSTERING FOR
T
EXT
C
LASSIFICATION
4. Text Classiﬁcation
Two contrasting classiﬁers that perform well on text classiﬁcation are (i) the simple Naive Bayesmethod and (ii) the more complex Support Vector Machines.
4.1 Naive Bayes Classiﬁer
Let
C
=
{
c
1
,
c
2
,... ,
c
l
}
be the set of
l
classes, and let
W
=
{
w
1
,... ,
w
m
}
be the set of words/featurescontained inthese classes. Givenanew document
d
, the probability that
d
belongs to class
c
i
is givenby Bayes rule,
p
(
c
i

d
) =
p
(
d

c
i
)
p
(
c
i
)
p
(
d
)
.
Assuming a generative multinomial model (McCallum and Nigam, 1998) and further assumingclassconditional independence of words yields the wellknown Naive Bayes classiﬁer (Mitchell,1997), which computes the most probable class for
d
as
c
∗
(
d
) =
argmax
c
i
p
(
c
i

d
) =
argmax
c
i
p
(
c
i
)
m
∏
t
=
1
p
(
w
t

c
i
)
n
(
w
t
,
d
)
(3)where
n
(
w
t
,
d
)
is the number of occurrences of word
w
t
in document
d
, and the quantities
p
(
w
t

c
i
)
are usually estimated using Laplace’s rule of succession:
p
(
w
t

c
i
) =
1
+
∑
d
j
∈
c
i
n
(
w
t
,
d
j
)
m
+
∑
mt
=
1
∑
d
j
∈
c
i
n
(
w
t
,
d
j
)
.
(4)The class priors
p
(
c
i
)
are estimated by the maximum likelihood estimate
p
(
c
i
) =

c
i

∑
j

c
j

. We nowmanipulate the Naive Bayes rule in order to interpret it in an information theoretic framework.Rewrite formula (3) by taking logarithms and dividing by the length of the document

d

to get
c
∗
(
d
) =
argmax
c
i
log
p
(
c
i
)

d

+
m
∑
t
=
1
p
(
w
t

d
)
log
p
(
w
t

c
i
)
,
(5)wherethe document
d
maybeviewedasaprobability distribution overwords:
p
(
w
t

d
) =
n
(
w
t
,
d
)
/

d

.Adding the entropy of
p
(
W

d
)
, i.e.,
−
∑
mt
=
1
p
(
w
t

d
)
log
p
(
w
t

d
)
to (5), and negating, we get
c
∗
(
d
) =
argmin
c
i
m
∑
t
=
1
p
(
w
t

d
)
log
p
(
w
t

d
)
p
(
w
t

c
i
)
−
log
p
(
c
i
)

d

(6)
=
argmin
c
i
KL
(
p
(
W

d
)
,
p
(
W

c
i
))
−
log
p
(
c
i
)

d

,
where
KL
(
p
,
q
)
denotes the KLdivergence between
p
and
q
as deﬁned in Section 3. Note thathere we have used
W
to denote the random variable that takes values from the set of words
W
.Thus, assuming equal class priors, we see that Naive Bayes may be interpreted as ﬁnding the classdistribution which has minimum KLdivergence from the given document. As we shall see againlater, KLdivergence seems to appear “naturally” in our setting.By (5), we can clearly see that Naive Bayes is a linear classiﬁer. Despite its crude assumptionabout the classconditional independence of words, Naive Bayes has been found to yield surprisingly good classiﬁcation performance, especially on text data. Plausible reasons for the success of Naive Bayes have been explored by Domingos and Pazzani (1997), Friedman (1997).
1269