Abstract
Consider a stream of dd-dimensional rows (points in RdRd) arriving sequentially. An ϵϵ-coreset is a positively weighted subset that approximates their sum of squared distances to any linear subspace of RdRd, up to a 1 pm ϵ1±ϵ factor. Unlike other data summarizations, such a coreset: (1) can be used to minimize faster any optimization function that uses this sum, such as regularized or constrained regression, (2) preserves input sparsity; (3) easily interpretable; (4) avoids numerical errors; (5) applies to problems with constraints on the input, such as subspaces that are spanned by few input points. Our main result is the first algorithm that returns such an ϵϵ-coreset using finite and constant memory during the streaming, i.e., independent of nn, the number of rows seen so far. The coreset consists of O(d °2d / ϵ 2)O(dlog2d/ϵ2) weighted rows, which is nearly optimal according to existing lower bounds of Ω (d / ϵ 2)Ω(d/ϵ2). We support our findings with experiments on the Wikipedia dataset benchmarked against state-of-the-art algorithms.
| Original language | English |
|---|---|
| Pages (from-to) | 8699-8712 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Knowledge and Data Engineering |
| Volume | 35 |
| Issue number | 9 |
| DOIs | |
| State | Published - 1 Sep 2023 |
Bibliographical note
Publisher Copyright:© 1989-2012 IEEE.
Keywords
- Big data
- coresets
- optimization
- streaming algorithms
ASJC Scopus subject areas
- Information Systems
- Computer Science Applications
- Computational Theory and Mathematics