Distributional word clusters vs. words for text categorization

Ron Bekkerman, Ran El-Yaniv, Naftali Tishby, Yoad Winter

Research output: Contribution to journalArticlepeer-review

Abstract

We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and efficient representation of documents. When combined with the classification power of the SVM, this method yields high performance in text categorization. This novel combination of SVM with word-cluster representation is compared with SVM-based categorization using the simpler bag-of-words (BOW) representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method based on word clusters significantly outperforms the word-based representation in terms of categorization accuracy or representation efficiency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the word-cluster representation. We investigate the potential reasons for this behavior and relate it to structural differences between the datasets.

Original languageEnglish
Pages (from-to)1183-1208
Number of pages26
JournalJournal of Machine Learning Research
Volume3
StatePublished - Mar 2003
Externally publishedYes

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Distributional word clusters vs. words for text categorization'. Together they form a unique fingerprint.

Cite this