Dimensionality reduction of massive sparse datasets using coresets

Dan Feldman, Mikhail Volkov, Daniela Rus

Research output: Contribution to journalConference articlepeer-review

Abstract

In this paper we present a practical solution with performance guarantees to the problem of dimensionality reduction for very large scale sparse matrices. We show applications of our approach to computing the Principle Component Analysis (PCA) of any n × d matrix, using one pass over the stream of its rows. Our solution uses coresets: a scaled subset of the n rows that approximates their sum of squared distances to every k-dimensional affine subspace. An open theoretical problem has been to compute such a coreset that is independent of both n and d. An open practical problem has been to compute a non-trivial approximation to the PCA of very large but sparse databases such as the Wikipedia document-term matrix in a reasonable time. We answer both of these questions affirmatively. Our main technical result is a new framework for deterministic coreset constructions based on a reduction to the problem of counting items in a stream.

Original languageEnglish
Pages (from-to)2774-2782
Number of pages9
JournalAdvances in Neural Information Processing Systems
StatePublished - 2016
Event30th Annual Conference on Neural Information Processing Systems, NIPS 2016 - Barcelona, Spain
Duration: 5 Dec 201610 Dec 2016

Bibliographical note

Funding Information:
Support for this research has been provided by Hon Hai/Foxconn Technology Group and NSFSaTC-BSF CNC 1526815, and in part by the Singapore MIT Alliance on Research and Technology through the Future of Urban Mobility project and by Toyota Research Institute (TRI). TRI provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. We are grateful for this support.

Publisher Copyright:
© 2016 NIPS Foundation - All Rights Reserved.

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Fingerprint

Dive into the research topics of 'Dimensionality reduction of massive sparse datasets using coresets'. Together they form a unique fingerprint.

Cite this