Abstract
We develop and analyze a method to reduce the size of a very large set of data points in a high-dimensional Euclidean space Rd to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first k principal components of the reduced set will return approximately the first k principal components of the original set or computing the centers of a k-means clustering on the reduced set will return an approximation for the original set. Such a reduced set is also known as a coreset. The main new feature of our construction is that the cardinality of the reduced set is independent of the dimension d of the input space and that the sets are mergeable [P. K. Agarwal et al., Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principals of Database Systems, 2012, pp. 23-34]. The latter property means that the union of two reduced sets is a reduced set for the union of the two original sets. It allows us to turn our methods into streaming or distributed algorithms using standard approaches. For problems such as k-means and subspace approximation the coreset sizes are also independent of the number of input points. Our method is based on data-dependently projecting the points on a low-dimensional subspace and reducing the cardinality of the points inside this subspace using known methods. The proposed approach works for a wide range of data analysis techniques including k-means clustering, principal component analysis, and subspace clustering. The main conceptual contribution is a new coreset definition that allows charging costs that appear for every solution to an additive constant.
| Original language | English |
|---|---|
| Pages (from-to) | 601-657 |
| Number of pages | 57 |
| Journal | SIAM Journal on Computing |
| Volume | 49 |
| Issue number | 3 |
| DOIs | |
| State | Published - 2020 |
Bibliographical note
Funding Information:The third author acknowledges the support of Collaborative Research Center 876, Project A2, funded by the German Science Foundation.
Publisher Copyright:
© 2020 Society for Industrial and Applied Mathematics.
Keywords
- Big data
- Coresets
- K-means
- PCA
- Projective clustering
- Streaming
ASJC Scopus subject areas
- General Computer Science
- General Mathematics
Fingerprint
Dive into the research topics of 'Turning big data into tiny data: Constant-size coresets for k-means, PCA, and projective clustering'. Together they form a unique fingerprint.Related research output
- 1 Conference contribution
-
Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering
Feldman, D., Schmidt, M. & Sohler, C., 2013, Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013. Association for Computing Machinery, p. 1434-1453 20 p. (Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver