Abstract
Let P be a set of n points in Rd, k ≥ 1 be an integer and ɛ ∈ (0, 1) be a constant. An ɛ-coreset is a subset C ⊆ P with appropriate non-negative weights (scalars), that approximates any given set Q ⊆ Rd of k centers. That is, the sum of squared distances over every point in P to its closest point in Q is the same, up to a factor of 1 ±- ɛ to the weighted sum of C to the same k centers. If the coreset is small, we can solve problems such as k-means clustering or its variants (e.g., discrete k-means, where the centers are restricted to be in P, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension d. We suggest the first such coreset of size independent of d. This is also the first deterministic coreset construction whose resulting size is not exponential in d. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the EnglishWikipedia using Amazon's cloud.
| Original language | English |
|---|---|
| Article number | 92 |
| Journal | Algorithms |
| Volume | 13 |
| Issue number | 4 |
| DOIs | |
| State | Published - 1 Apr 2020 |
Bibliographical note
Funding Information:Funding: This research was funded by BSF/NSF Grant Number: 2014627 and by GIF 2408-407.6 Young Scientists’ Program Contract No.: I-1186-407.9-2014.
Publisher Copyright:
© 2020 by the authors.
Keywords
- Big data
- Clustering
- Coreset
- KMeans
- Streaming
ASJC Scopus subject areas
- Theoretical Computer Science
- Numerical Analysis
- Computational Theory and Mathematics
- Computational Mathematics
Fingerprint
Dive into the research topics of 'Deterministic coresets for k-Means of big sparse data'. Together they form a unique fingerprint.Related research output
- 1 Conference contribution
-
Coresets for visual summarization with applications to loop closure
Volkov, M., Rosman, G., Feldman, D., Fisher, J. W. & Rus, D., 29 Jun 2015, 2015 IEEE International Conference on Robotics and Automation, ICRA 2015. June ed. Institute of Electrical and Electronics Engineers Inc., p. 3638-3645 8 p. 7139704. (Proceedings - IEEE International Conference on Robotics and Automation; vol. 2015-June, no. June).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver