## Abstract

We provide the first streaming algorithm for computing a provable approximation to the κ-means of sparse Big Data. Here, sparse Big Data is a stream of n vectors in ℝd, where each vector has O(1) non-zeroes entries and possibly d ≥ n. E.g., adjacency matrix of a graph, web-links, social network, document-terms, or image-features matrices. Our streaming algorithm stores at most logn κ^{O(1)} input points in memory. If the stream is distributed among M machines, the running time reduces by a factor of M, while communicating a total of M κ^{O(1)} (sparse) input points between the machines. Our main contribution is a deterministic algorithm for computing a sparse (κ,ϵ)-coreset, which is a weighted subset of κ^{O(1)} input points that approximates the sum of squared distances from the n input points to every set of κ centers, up to (1 ± ϵ) factor, for any given constant ϵ > 0. This is the first such coreset of size independent of both d and n. Our experimental results show how our algorithm can bs used to boost the performance of any given κ-means heuristics, even in the off-line setting. Open access to our implementation is also provided.

Original language | English |
---|---|

Title of host publication | 16th SIAM International Conference on Data Mining 2016, SDM 2016 |

Editors | Sanjay Chawla Venkatasubramanian, Wagner Meira |

Publisher | Society for Industrial and Applied Mathematics Publications |

Pages | 342-350 |

Number of pages | 9 |

ISBN (Electronic) | 9781510828117 |

DOIs | |

State | Published - 2016 |

Event | 16th SIAM International Conference on Data Mining 2016, SDM 2016 - Miami, United States Duration: 5 May 2016 → 7 May 2016 |

### Publication series

Name | 16th SIAM International Conference on Data Mining 2016, SDM 2016 |
---|

### Conference

Conference | 16th SIAM International Conference on Data Mining 2016, SDM 2016 |
---|---|

Country/Territory | United States |

City | Miami |

Period | 5/05/16 → 7/05/16 |

### Bibliographical note

Funding Information:Support for this work has been provided in part by BSF/NSF Grant Number: 2014627 and by GIF 2408-407.6 Young Scientists' Program Contract No.: I-1186-407.9-2014.

Publisher Copyright:

Copyright © by SIAM.

## Keywords

- Big-Data
- Clustering
- Coresets
- Distributed
- Streaming
- κ-Means

## ASJC Scopus subject areas

- Computer Science Applications
- Software