Abstract
Network hardware faults are inevitable in massive scale-out ML training clusters. Networks in such systems are inherently designed for resiliency, routing around faulty components as long as a fault is detected. Unfortunately, some silent faults evade detection. Notably, the effects of silent faults are amplified in modern production networks that deploy per-packet load balancing, because packets of a single flow traverse many network paths, making such faults particularly hard to localize.We present FlowPulse, the first system for rapid, low-overhead detection of silent network faults in per-packet spraying networks. Our key insight is that distributed training workloads induce predictable traffic patterns in the switch ports we refer to as a temporal symmetry. This symmetry emerges even in the presence of known faults, and can be modeled analytically or learned from the traffic. FlowPulse detects new network faults of training tasks by identifying subtle deviations from the expected temporal symmetry on each switch during collective communications, all without any inter-switch coordination or probing overheads. Our preliminary results show that FlowPulse is effective in detecting silent faults in a variety of realistic settings, topologies and fault patterns. For example, FlowPulse identifies a single faulty link with 1.5% corruption rate by checking temporal symmetry in a full two-level fat tree topology with 32 leaf switches while performing Ring-AllReduce on all nodes.
| Original language | English |
|---|---|
| Title of host publication | HotNets 2025 - Proceedings of the 2025 24th ACM Workshop on Hot Topics in Networks |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 139-148 |
| Number of pages | 10 |
| ISBN (Electronic) | 9798400722806 |
| DOIs | |
| State | Published - 17 Nov 2025 |
| Event | 24th ACM Workshop on Hot Topics in Networks, HotNets 2025 - College Park, United States Duration: 17 Nov 2025 → 18 Nov 2025 |
Publication series
| Name | HotNets 2025 - Proceedings of the 2025 24th ACM Workshop on Hot Topics in Networks |
|---|
Conference
| Conference | 24th ACM Workshop on Hot Topics in Networks, HotNets 2025 |
|---|---|
| Country/Territory | United States |
| City | College Park |
| Period | 17/11/25 → 18/11/25 |
Bibliographical note
Publisher Copyright:© 2025 Copyright held by the owner/author(s).
Keywords
- measurement and telemetry
- network debugging
- networks for machine learning
ASJC Scopus subject areas
- Computer Networks and Communications
Fingerprint
Dive into the research topics of 'FlowPulse: Catching Network Failures in ML Clusters'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver