Skip to main navigation Skip to search Skip to main content

FlowPulse: Catching Network Failures in ML Clusters

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Network hardware faults are inevitable in massive scale-out ML training clusters. Networks in such systems are inherently designed for resiliency, routing around faulty components as long as a fault is detected. Unfortunately, some silent faults evade detection. Notably, the effects of silent faults are amplified in modern production networks that deploy per-packet load balancing, because packets of a single flow traverse many network paths, making such faults particularly hard to localize.We present FlowPulse, the first system for rapid, low-overhead detection of silent network faults in per-packet spraying networks. Our key insight is that distributed training workloads induce predictable traffic patterns in the switch ports we refer to as a temporal symmetry. This symmetry emerges even in the presence of known faults, and can be modeled analytically or learned from the traffic. FlowPulse detects new network faults of training tasks by identifying subtle deviations from the expected temporal symmetry on each switch during collective communications, all without any inter-switch coordination or probing overheads. Our preliminary results show that FlowPulse is effective in detecting silent faults in a variety of realistic settings, topologies and fault patterns. For example, FlowPulse identifies a single faulty link with 1.5% corruption rate by checking temporal symmetry in a full two-level fat tree topology with 32 leaf switches while performing Ring-AllReduce on all nodes.

Original languageEnglish
Title of host publicationHotNets 2025 - Proceedings of the 2025 24th ACM Workshop on Hot Topics in Networks
PublisherAssociation for Computing Machinery, Inc
Pages139-148
Number of pages10
ISBN (Electronic)9798400722806
DOIs
StatePublished - 17 Nov 2025
Event24th ACM Workshop on Hot Topics in Networks, HotNets 2025 - College Park, United States
Duration: 17 Nov 202518 Nov 2025

Publication series

NameHotNets 2025 - Proceedings of the 2025 24th ACM Workshop on Hot Topics in Networks

Conference

Conference24th ACM Workshop on Hot Topics in Networks, HotNets 2025
Country/TerritoryUnited States
CityCollege Park
Period17/11/2518/11/25

Bibliographical note

Publisher Copyright:
© 2025 Copyright held by the owner/author(s).

Keywords

  • measurement and telemetry
  • network debugging
  • networks for machine learning

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'FlowPulse: Catching Network Failures in ML Clusters'. Together they form a unique fingerprint.

Cite this