Data weaving: Scaling up the state-of-the-art in data clustering

Ron Bekkerman, Martin Scholz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information-theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi-dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often out-performed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving - a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal - it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.

Original languageEnglish
Title of host publicationProceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
Pages1083-1092
Number of pages10
DOIs
StatePublished - 2008
Externally publishedYes
Event17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA, United States
Duration: 26 Oct 200830 Oct 2008

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Conference

Conference17th ACM Conference on Information and Knowledge Management, CIKM'08
Country/TerritoryUnited States
CityNapa Valley, CA
Period26/10/0830/10/08

Keywords

  • Information-theoretic clustering
  • Multi-modal clustering
  • Parallel and distributed data mining

ASJC Scopus subject areas

  • General Decision Sciences
  • General Business, Management and Accounting

Fingerprint

Dive into the research topics of 'Data weaving: Scaling up the state-of-the-art in data clustering'. Together they form a unique fingerprint.

Cite this