Sets Clustering

Ibrahim Jubran, Murad Tukan, Alaa Maalouf, Dan Feldman

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The input to the sets-k-means problem is an integer k 1 and a set P = fP1; Png of fixed sized sets in Rd. The goal is to compute a set C of k centers (points) in Rd that minimizes the sum P P2P minp2P;c2C kp-ck2 of squared distances to these sets. An "-core-set for this problem is a weighted subset of P that approximates this sum up to 1 " factor, for every set C of k centers in Rd. We prove that such a core-set of O(log2 n) sets always exists, and can be computed in O(n log n) time, for every input P and every fixed d; k 1 and " 2 (0; 1). The result easily generalized for any metric space, distances to the power of z 0, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS (1 + " approximation) for the sets-k-means problem that takes time near linear in n. This is the first result even for sets-mean on the plane (k = 1, d = 2). Open source code and experimental results for document classification and facility locations are also provided.

Original languageEnglish
Title of host publication37th International Conference on Machine Learning, ICML 2020
EditorsHal Daume, Aarti Singh
PublisherInternational Machine Learning Society (IMLS)
Pages4961-4972
Number of pages12
ISBN (Electronic)9781713821120
StatePublished - 2020
Event37th International Conference on Machine Learning, ICML 2020 - Virtual, Online
Duration: 13 Jul 202018 Jul 2020

Publication series

Name37th International Conference on Machine Learning, ICML 2020
VolumePartF168147-7

Conference

Conference37th International Conference on Machine Learning, ICML 2020
CityVirtual, Online
Period13/07/2018/07/20

Bibliographical note

Publisher Copyright:
© 2020 by the Authors.

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Human-Computer Interaction
  • Software

Fingerprint

Dive into the research topics of 'Sets Clustering'. Together they form a unique fingerprint.

Cite this