Coresets for the average case error for finite query sets

Alaa Maalouf, Ibrahim Jubran, Murad Tukan, Dan Feldman

Research output: Contribution to journalArticlepeer-review

Abstract

Coreset is usually a small weighted subset of an input set of items, that provably approximates their loss function for a given set of queries (models, classifiers, hypothesis). That is, the maximum (worst-case) error over all queries is bounded. To obtain smaller coresets, we suggest a natural relaxation: coresets whose average error over the given set of queries is bounded. We provide both deterministic and randomized (generic) algorithms for computing such a coreset for any finite set of queries. Unlike most corresponding coresets for the worst-case error, the size of the coreset in this work is independent of both the input size and its Vapnik–Chervonenkis (VC) dimension. The main technique is to reduce the average-case coreset into the vector summarization problem, where the goal is to compute a weighted subset of the n input vectors which approximates their sum. We then suggest the first algorithm for computing this weighted subset in time that is linear in the input size, for n ≫ 1/ε, where ε is the approximation error, improving, e.g., both [ICML’17] and applications for principal component analysis (PCA) [NIPS’16]. Experimental results show significant and consistent improvement also in practice. Open source code is provided.

Original languageEnglish
Article number6689
JournalSensors
Volume21
Issue number19
DOIs
StatePublished - 8 Oct 2021

Bibliographical note

Publisher Copyright:
© 2021 by the authors. Licensee MDPI, Basel, Switzerland.

Keywords

  • Approximation algorithms
  • Average case analysis
  • Big data
  • Coreset
  • Dimensionality reduction
  • Sparsification

ASJC Scopus subject areas

  • Analytical Chemistry
  • Information Systems
  • Instrumentation
  • Atomic and Molecular Physics, and Optics
  • Electrical and Electronic Engineering
  • Biochemistry

Fingerprint

Dive into the research topics of 'Coresets for the average case error for finite query sets'. Together they form a unique fingerprint.

Cite this