κ-means for streaming and distributed big sparse data

Artem Barger, Dan Feldman

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We provide the first streaming algorithm for computing a provable approximation to the κ-means of sparse Big Data. Here, sparse Big Data is a stream of n vectors in ℝd, where each vector has O(1) non-zeroes entries and possibly d ≥ n. E.g., adjacency matrix of a graph, web-links, social network, document-terms, or image-features matrices. Our streaming algorithm stores at most logn κO(1) input points in memory. If the stream is distributed among M machines, the running time reduces by a factor of M, while communicating a total of M κO(1) (sparse) input points between the machines. Our main contribution is a deterministic algorithm for computing a sparse (κ,ϵ)-coreset, which is a weighted subset of κO(1) input points that approximates the sum of squared distances from the n input points to every set of κ centers, up to (1 ± ϵ) factor, for any given constant ϵ > 0. This is the first such coreset of size independent of both d and n. Our experimental results show how our algorithm can bs used to boost the performance of any given κ-means heuristics, even in the off-line setting. Open access to our implementation is also provided.

Original languageEnglish
Title of host publication16th SIAM International Conference on Data Mining 2016, SDM 2016
EditorsSanjay Chawla Venkatasubramanian, Wagner Meira
PublisherSociety for Industrial and Applied Mathematics Publications
Pages342-350
Number of pages9
ISBN (Electronic)9781510828117
DOIs
StatePublished - 2016
Event16th SIAM International Conference on Data Mining 2016, SDM 2016 - Miami, United States
Duration: 5 May 20167 May 2016

Publication series

Name16th SIAM International Conference on Data Mining 2016, SDM 2016

Conference

Conference16th SIAM International Conference on Data Mining 2016, SDM 2016
Country/TerritoryUnited States
CityMiami
Period5/05/167/05/16

Bibliographical note

Funding Information:
Support for this work has been provided in part by BSF/NSF Grant Number: 2014627 and by GIF 2408-407.6 Young Scientists' Program Contract No.: I-1186-407.9-2014.

Publisher Copyright:
Copyright © by SIAM.

Keywords

  • Big-Data
  • Clustering
  • Coresets
  • Distributed
  • Streaming
  • κ-Means

ASJC Scopus subject areas

  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'κ-means for streaming and distributed big sparse data'. Together they form a unique fingerprint.

Cite this