# Least-Mean-Squares Coresets for Infinite Streams

Vladimir Braverman, Dan Feldman, Harry Lang, Daniela Rus, Adiel Statman

Research output: Contribution to journalArticlepeer-review

## Abstract

Consider a stream of <inline-formula><tex-math notation="LaTeX">$d$</tex-math></inline-formula>-dimensional rows (points in <inline-formula><tex-math notation="LaTeX">$\mathbb {R}^{d}$</tex-math></inline-formula>) arriving sequentially. An <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math></inline-formula>-coreset is a positively weighted subset that approximates their sum of squared distances to any linear subspace of <inline-formula><tex-math notation="LaTeX">$\mathbb{R}^{d}$</tex-math></inline-formula>, up to a <inline-formula><tex-math notation="LaTeX">$1 \pm \epsilon$</tex-math></inline-formula> factor. Unlike other data summarizations, such a coreset: (1) can be used to minimize faster any optimization function that uses this sum, such as regularized or constrained regression, (2) preserves input sparsity; (3) easily interpretable; (4) avoids numerical errors; (5) applies to problems with constraints on the input, such as subspaces that are spanned by few input points. Our main result is the first algorithm that returns such an <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math></inline-formula>-coreset using finite and constant memory during the streaming, i.e., independent of <inline-formula><tex-math notation="LaTeX">$n$</tex-math></inline-formula>, the number of rows seen so far. The coreset consists of <inline-formula><tex-math notation="LaTeX">$O(d \log ^{2} d / \epsilon ^{2})$</tex-math></inline-formula> weighted rows, which is nearly optimal according to existing lower bounds of <inline-formula><tex-math notation="LaTeX">$\Omega (d / \epsilon ^{2})$</tex-math></inline-formula>. We support our findings with experiments on the Wikipedia dataset benchmarked against state-of-the-art algorithms.

Original language English 1-18 18 IEEE Transactions on Knowledge and Data Engineering https://doi.org/10.1109/TKDE.2022.3180808 Accepted/In press - 2022

IEEE

## Keywords

• Big Data
• Big Data
• Computational modeling
• Coresets
• Covariance matrices
• Libraries
• Memory management
• Optimization
• Random access memory
• Sparse matrices
• Streaming Algorithms

## ASJC Scopus subject areas

• Information Systems
• Computer Science Applications
• Computational Theory and Mathematics

## Fingerprint

Dive into the research topics of 'Least-Mean-Squares Coresets for Infinite Streams'. Together they form a unique fingerprint.