Distributional word clusters vs. words for text categorization

Ron Bekkerman, Ran El-Yaniv, Naftali Tishby, Yoad Winter

Research output: Contribution to journalArticlepeer-review


We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and efficient representation of documents. When combined with the classification power of the SVM, this method yields high performance in text categorization. This novel combination of SVM with word-cluster representation is compared with SVM-based categorization using the simpler bag-of-words (BOW) representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method based on word clusters significantly outperforms the word-based representation in terms of categorization accuracy or representation efficiency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the word-cluster representation. We investigate the potential reasons for this behavior and relate it to structural differences between the datasets.

Original languageEnglish
Pages (from-to)1183-1208
Number of pages26
JournalJournal of Machine Learning Research
StatePublished - Mar 2003
Externally publishedYes

Bibliographical note

Funding Information:
[1] US Constitution, Articles I, II, and III. [2] Ibid . [3] 14 CFR §§400-1199 (2008). [4] 49 USC §70101 (2000, Suppl. 2004). [5] See, e.g., Project of the Nuclear Age Peace Foundation, Presidential Directive on National Space Policy, http://nuclearfiles.org/menu/key-issues/space-weapons/issues/national-space-policy-presidential-directive.html (last visited 1 October 2008). [6] T.R. Hughes E. Rosenberg

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence


Dive into the research topics of 'Distributional word clusters vs. words for text categorization'. Together they form a unique fingerprint.

Cite this