Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

Dan Feldman, Melanie Schmidt, Christian Sohler

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We prove that the sum of the squared Euclidean distances from the n rows of an n x d matrix A to any compact set that is spanned by k vectors in double-struck Rd can be approximated up to (1 + ε)-factor, for an arbitrary small ε > 0, using the O(k/ε2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + ε)-approximated by an optimal k-means clustering of their projection on the O(k/ε2) first right singular vectors (principle components) of A. A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + ε)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size script O(k) for handling k-means queries, (j, 1)-coresets of size script O(j) for PCA queries, and (j, k)-coresets of size (log n)script O(jk) for any j,k ≥ 1 and constant ε ∈ (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ∼ n. Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d. For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size k1/εscript O(1) for k-means and a special class of bregman divergences that is less dependent on the properties of the squared Euclidean distance.

Original languageEnglish
Title of host publicationProceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013
PublisherAssociation for Computing Machinery
Pages1434-1453
Number of pages20
ISBN (Print)9781611972511
DOIs
StatePublished - 2013
Externally publishedYes
Event24th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013 - New Orleans, LA, United States
Duration: 6 Jan 20138 Jan 2013

Publication series

NameProceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms

Conference

Conference24th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013
Country/TerritoryUnited States
CityNew Orleans, LA
Period6/01/138/01/13

ASJC Scopus subject areas

  • Software
  • General Mathematics

Fingerprint

Dive into the research topics of 'Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering'. Together they form a unique fingerprint.

Cite this