Latent fault detection with unbalanced workloads

Moshe Gabel, Kento Sato, Daniel Keren, Satoshi Matsuoka, Assaf Schuster

Research output: Contribution to journalConference articlepeer-review


Big data means big datacenters, comprised of hundreds or thousands of machines. With so many machines, failures are commonplace. Failure detection is crucial: undetected failures may lead to data loss and outages. Recent health monitoring approaches use anomaly detection to forecast failures { anomalous machines are considered to be at risk of future failures. Our previous work focused on detecting latent faults in large web services, which are often characterized by scale-out architecture where load is dynamically balanced. We proposed a robust and unsupervised latent fault detector for such systems, with statistical bounds on the rate of false positives. That detector, however, is unsuitable for applications without dynamic load balancing, such as statically-balanced key-value stores, Hadoop jobs, and supercomputer applications. We describe an improved latent fault detection method for unbalanced workloads. It retains the advantages of our previous methods: it is unsupervised, robust to changes, and statistically sound. Moreover, the statistical bounds for the new method scale better with the number of machines, and so dramatically reduce the number of measurements needed. Preliminary evaluation on supercomputer logs shows that the new method is able to correctly predict some failures, while our previous methods completely fail in this setting.

Original languageEnglish
Pages (from-to)118-124
Number of pages7
JournalCEUR Workshop Proceedings
StatePublished - 2015
EventJoint Workshops of the International Conference on Extending Database Technology and the International Conference on Database Theory, EDBT/ICDT-WS 2015 - Brussels, Belgium
Duration: 27 Mar 2015 → …

Bibliographical note

Publisher Copyright:
© 2015, Copyright is with the authors.


  • Anomaly detection
  • Data mining
  • Health monitoring

ASJC Scopus subject areas

  • General Computer Science


Dive into the research topics of 'Latent fault detection with unbalanced workloads'. Together they form a unique fingerprint.

Cite this