Turning Big Data Into Tiny Data: Coresets for Unsupervised Learning Problems

Research output: Contribution to journalArticlepeer-review

Abstract

We develop and analyze a method to reduce the size of a very large set of data points in a high-dimensional Euclidean space \BbbRd to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first k principal components of the reduced set will return approximately the first k principal components of the original set, or computing the centers of a k-means clustering on the reduced set will return an approximation for the original set. Such a reduced set is also known as a coreset. The main new features of our construction are that the cardinality of the reduced set is independent of the dimension d of the input space and that the sets are mergeable [P. K. Agarwal et al., Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principals of Database Systems, 2012, pp. 23–34]. The latter property means that the union of two reduced sets is a reduced set for the union of the two original sets. It allows us to turn our methods into streaming or distributed algorithms using standard approaches. For problems such as k-means and subspace approximation the coreset sizes are also independent of the number of input points. Our method is based on data-dependently projecting the points on a low-dimensional subspace and reducing the cardinality of the points inside this subspace using known methods. The proposed approach works for a wide range of data analysis techniques including k-means clustering, principal component analysis, and subspace clustering. The main conceptual contribution is a new coreset definition that allows charging for the costs that appear for every solution to an additive constant.

Original languageEnglish
Pages (from-to)801-861
Number of pages61
JournalSIAM Review
Volume67
Issue number4
DOIs
StatePublished - 6 Nov 2025

Bibliographical note

Publisher Copyright:
© 2025 Society for Industrial and Applied Mathematics

Keywords

  • big data
  • coresets
  • k-means
  • PCA
  • projective clustering
  • streaming

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computational Mathematics
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'Turning Big Data Into Tiny Data: Coresets for Unsupervised Learning Problems'. Together they form a unique fingerprint.

Cite this