Abstract
Big data means big datacenters, comprised of hundreds or thousands of machines. With so many machines, failures are commonplace. Failure detection is crucial: undetected failures may lead to data loss and outages. Recent health monitoring approaches use anomaly detection to forecast failures { anomalous machines are considered to be at risk of future failures. Our previous work focused on detecting latent faults in large web services, which are often characterized by scale-out architecture where load is dynamically balanced. We proposed a robust and unsupervised latent fault detector for such systems, with statistical bounds on the rate of false positives. That detector, however, is unsuitable for applications without dynamic load balancing, such as statically-balanced key-value stores, Hadoop jobs, and supercomputer applications. We describe an improved latent fault detection method for unbalanced workloads. It retains the advantages of our previous methods: it is unsupervised, robust to changes, and statistically sound. Moreover, the statistical bounds for the new method scale better with the number of machines, and so dramatically reduce the number of measurements needed. Preliminary evaluation on supercomputer logs shows that the new method is able to correctly predict some failures, while our previous methods completely fail in this setting.
Original language | English |
---|---|
Pages (from-to) | 118-124 |
Number of pages | 7 |
Journal | CEUR Workshop Proceedings |
Volume | 1330 |
State | Published - 2015 |
Event | Joint Workshops of the International Conference on Extending Database Technology and the International Conference on Database Theory, EDBT/ICDT-WS 2015 - Brussels, Belgium Duration: 27 Mar 2015 → … |
Bibliographical note
Publisher Copyright:© 2015, Copyright is with the authors.
Keywords
- Anomaly detection
- Data mining
- Health monitoring
ASJC Scopus subject areas
- General Computer Science