Abstract
We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and efficient representation of documents. When combined with the classification power of the SVM, this method yields high performance in text categorization. This novel combination of SVM with word-cluster representation is compared with SVM-based categorization using the simpler bag-of-words (BOW) representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method based on word clusters significantly outperforms the word-based representation in terms of categorization accuracy or representation efficiency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the word-cluster representation. We investigate the potential reasons for this behavior and relate it to structural differences between the datasets.
Original language | English |
---|---|
Pages (from-to) | 1183-1208 |
Number of pages | 26 |
Journal | Journal of Machine Learning Research |
Volume | 3 |
State | Published - Mar 2003 |
Externally published | Yes |
Bibliographical note
Funding Information:[1] US Constitution, Articles I, II, and III. [2] Ibid . [3] 14 CFR §§400-1199 (2008). [4] 49 USC §70101 (2000, Suppl. 2004). [5] See, e.g., Project of the Nuclear Age Peace Foundation, Presidential Directive on National Space Policy, http://nuclearfiles.org/menu/key-issues/space-weapons/issues/national-space-policy-presidential-directive.html (last visited 1 October 2008). [6] T.R. Hughes E. Rosenberg
ASJC Scopus subject areas
- Software
- Control and Systems Engineering
- Statistics and Probability
- Artificial Intelligence