Abstract
Entropy is a fundamental property of data and a key metric in many scientific and engineering fields. Entropy estimation has been extensively studied, but almost always under the assumption that there is a single data stream, seen in its entirety by one node running the estimation algorithm. Multiple distributed data sources are becoming increasingly common, however, with applications in signal processing, computer science, medicine, physics, and more. Centralizing all data can be infeasible, for example in networks of battery or bandwidth limited sensors, so entropy estimation in distributed streams requires new, communication-efficient approaches. We propose a practical communication-efficient algorithm for continuously approximating the entropy of distributed streams, with deterministic, user-defined error bounds. Unlike previous streaming methods, it supports deletions and variable-sized time-based sliding windows, while still avoiding communication when possible. Moreover, it optionally incorporates a state-of-the-art entropy sketch, allowing for both bandwidth reduction and monitoring very high dimensional problems. Finally, it provides the approximation to all nodes, rather than to a centralized location, which is important in settings such as wireless sensor networks. Evaluation on several public datasets from real application domains shows that our adaptive algorithm can often reduce the number of messages by two orders of magnitude, compared to centralizing all data in one node.
Original language | English |
---|---|
Title of host publication | KDD 2017 - Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining |
Publisher | Association for Computing Machinery |
Pages | 837-846 |
Number of pages | 10 |
ISBN (Electronic) | 9781450348874 |
DOIs | |
State | Published - 13 Aug 2017 |
Event | 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017 - Halifax, Canada Duration: 13 Aug 2017 → 17 Aug 2017 |
Publication series
Name | Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining |
---|---|
Volume | Part F129685 |
Conference
Conference | 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017 |
---|---|
Country/Territory | Canada |
City | Halifax |
Period | 13/08/17 → 17/08/17 |
Bibliographical note
Publisher Copyright:© 2017 Association for Computing Machinery.
Keywords
- Data mining
- Distributed streams
- Entropy estimation
ASJC Scopus subject areas
- Software
- Information Systems