Sets Clustering

Research output: Contribution to journalConference articlepeer-review

Abstract

The input to the sets-k-means problem is an in- teger k ≥ 1 and a set P = {P1Pn} of fixed sized sets in Rd. The goal is to compute a set C of k centers (points) in Rd that minimizes sum ΣPPminp∈P,c∈C p − c 2 of squared distances to these sets. An ε-core-set for this prob lem is a weighted subset of P that approximates this sum up to 1 ± ε factor, for every set C of k centers in Rd. We prove that such a core-set of O(log2 n) sets always exists, and can be computed in O(n log n) time, for every input P and every fixed d, k ≥ 1 and ε ∈ (0, 1). The result easily generalized for any metric space, distances to the power of z > 0, and M-estimators that handle outliers. Applying an inefficient but optimaldle outliers. Applying an inefficient but optimal first PTAS (1 + ε approximation) for the sets-k- means problem that takes time near linear in n. This is the first result even for sets-mean on the plane (k = 1, d = 2). Open source code and experimental results for document classification and facility locations are also provided.

Original languageEnglish
JournalProceedings of Machine Learning Research
Volume119
StatePublished - 2020
Event37th International Conference on Machine Learning, ICML 2020 - Virtual, Online
Duration: 13 Jul 202018 Jul 2020

Bibliographical note

Publisher Copyright:
© 2020 by the author(s).

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Sets Clustering'. Together they form a unique fingerprint.

Cite this