Interactive clustering of text collections according to a user-specified criterion

Ron Bekkerman, Hema Raghavan, James Allan, Koji Eguchi

Research output: Contribution to journalConference articlepeer-review

Abstract

Document clustering is traditionally tackled from the perspective of grouping documents that are topically similar. However, many other criteria for clustering documents can be considered: for example, documents' genre or the author's mood. We propose an interactive scheme for clustering document collections, based on any criterion of the user's preference. The user holds an active position in the clustering process: first, she chooses the types of features suitable to the underlying task, leading to a task-specific document representation. She can then provide examples of features - if such examples are emerging, e.g., when clustering by the author's sentiment, words like 'perfect', 'mediocre', 'awful' are intuitively good features. The algorithm proceeds iteratively, and the user can fix errors made by the clustering system at the end of each iteration. Such an interactive clustering method demonstrates excellent results on clustering by sentiment, substantially outperforming an SVM trained on a large amount of labeled data. Even if features are not provided because they are not intuitively obvious to the user - e.g., what would be good features for clustering by genre using part-of-speech trigrams? - our multi-modal clustering method performs significantly better than k -means and Latent Dirichlet Allocation (LDA).

Original languageEnglish
Pages (from-to)684-689
Number of pages6
JournalIJCAI International Joint Conference on Artificial Intelligence
StatePublished - 2007
Externally publishedYes
Event20th International Joint Conference on Artificial Intelligence, IJCAI 2007 - Hyderabad, India
Duration: 6 Jan 200712 Jan 2007

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Interactive clustering of text collections according to a user-specified criterion'. Together they form a unique fingerprint.

Cite this