On coresets for support vector machines

Murad Tukan, Cenk Baykal, Dan Feldman, Daniela Rus

Research output: Contribution to journalArticlepeer-review

Abstract

We present an efficient coreset construction algorithm for large-scale Support Vector Machine (SVM) training in Big Data and streaming applications. A coreset is a small, representative subset of the original data points such that a model trained on the coreset is provably competitive with that trained on the original data set. Since the size of the coreset is generally much smaller than the original set, our preprocess-then-train scheme has potential to lead to significant speedups when training SVM models. We prove lower and upper bounds on the size of the coreset required to obtain small data summaries for the SVM problem. As a corollary, we show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings. We evaluate the performance of our algorithm on real-world and synthetic data sets. Our experimental results reaffirm the favorable theoretical properties of our algorithm and demonstrate its practical effectiveness in accelerating SVM training.

Original languageEnglish
Pages (from-to)171-191
Number of pages21
JournalTheoretical Computer Science
Volume890
DOIs
StatePublished - 12 Oct 2021

Bibliographical note

Funding Information:
This research was supported in part by the U.S. National Science Foundation (NSF) under Awards 1723943 , Office of Naval Research (ONR) Grant N00014-18-1-2830 , Microsoft corporation , and JP Morgan Chase .

Publisher Copyright:
© 2021 Elsevier B.V.

Keywords

  • Coresets
  • Data reduction
  • Large-scale learning
  • Streaming
  • Support vector machines

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science (all)

Fingerprint

Dive into the research topics of 'On coresets for support vector machines'. Together they form a unique fingerprint.

Cite this