## Abstract

The input to the sets-k-means problem is an integer k 1 and a set P = fP1; Png of fixed sized sets in Rd. The goal is to compute a set C of k centers (points) in Rd that minimizes the sum P P2P minp2P;c2C kp-ck2 of squared distances to these sets. An "-core-set for this problem is a weighted subset of P that approximates this sum up to 1 " factor, for every set C of k centers in Rd. We prove that such a core-set of O(log2 n) sets always exists, and can be computed in O(n log n) time, for every input P and every fixed d; k 1 and " 2 (0; 1). The result easily generalized for any metric space, distances to the power of z 0, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS (1 + " approximation) for the sets-k-means problem that takes time near linear in n. This is the first result even for sets-mean on the plane (k = 1, d = 2). Open source code and experimental results for document classification and facility locations are also provided.

Original language | English |
---|---|

Title of host publication | 37th International Conference on Machine Learning, ICML 2020 |

Editors | Hal Daume, Aarti Singh |

Publisher | International Machine Learning Society (IMLS) |

Pages | 4961-4972 |

Number of pages | 12 |

ISBN (Electronic) | 9781713821120 |

State | Published - 2020 |

Event | 37th International Conference on Machine Learning, ICML 2020 - Virtual, Online Duration: 13 Jul 2020 → 18 Jul 2020 |

### Publication series

Name | 37th International Conference on Machine Learning, ICML 2020 |
---|---|

Volume | PartF168147-7 |

### Conference

Conference | 37th International Conference on Machine Learning, ICML 2020 |
---|---|

City | Virtual, Online |

Period | 13/07/20 → 18/07/20 |

### Bibliographical note

Publisher Copyright:© 2020 by the Authors.

## ASJC Scopus subject areas

- Computational Theory and Mathematics
- Human-Computer Interaction
- Software