Abstract
How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dκ3/ε2) data points suffices for computing a (1 + ε)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones.
| Original language | English |
|---|---|
| Title of host publication | Advances in Neural Information Processing Systems 24 |
| Subtitle of host publication | 25th Annual Conference on Neural Information Processing Systems 2011, NIPS 2011 |
| Publisher | Neural Information Processing Systems |
| ISBN (Print) | 9781618395993 |
| State | Published - 2011 |
| Externally published | Yes |
| Event | 25th Annual Conference on Neural Information Processing Systems 2011, NIPS 2011 - Granada, Spain Duration: 12 Dec 2011 → 14 Dec 2011 |
Publication series
| Name | Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011, NIPS 2011 |
|---|
Conference
| Conference | 25th Annual Conference on Neural Information Processing Systems 2011, NIPS 2011 |
|---|---|
| Country/Territory | Spain |
| City | Granada |
| Period | 12/12/11 → 14/12/11 |
ASJC Scopus subject areas
- Information Systems
Fingerprint
Dive into the research topics of 'Scalable training of mixture models via coresets'. Together they form a unique fingerprint.Related research output
- 1 Article
-
Training Gaussian mixture models at scale via coresets
Lucic, M., Faulkner, M., Krause, A. & Feldman, D., 1 May 2018, In: Journal of Machine Learning Research. 18, p. 1-25 25 p.Research output: Contribution to journal › Article › peer-review
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver