Table of Contents
Fetching ...

SHIFT: An RDMA Failure-Resilient Layer for Distributed Training

Shengkai Lin, Kairui Zhou, Yibo Wu, Hongtao Zhang, Qinwei Yang, Wei Zhang, Arvind Krishnamurthy, Shizhen Zhao

TL;DR

This work addresses the failure-prone nature of large-scale distributed training by introducing SHIFT, a fault-resilient layer that operates at the RDMA layer to tolerate network anomalies without disrupting application progress. SHIFT enables seamless redirection of RDMA traffic across intra-host NICs using perimeter mechanisms such as out-of-band KV attribute transfer, inline work queue rewind, and CQ event-based receiver notification, all while remaining application-agnostic. The paper demonstrates a practical design and implementation (SHIFTLib) with a dual-phase execution model, a four-state send/receive QP workflow, and extensive microbenchmark and PyTorch-based evaluation showing minimal data-path overhead and robust fault tolerance. By combining this RDMA-layer resilience with existing application-layer checkpointing, SHIFT can significantly reduce wasted computation and training time in failure-prone, large-scale environments. Overall, SHIFT offers a practical, scalable solution to improve robustness of distributed AI training on diverse network topologies and failure modes.

Abstract

With gang scheduling in large-scale distributed Large Language Model training, a single network anomaly can propagate and cause complete task failure. The frequency of such anomalies increases with network scale. However, existing fault-tolerance mechanisms, such as checkpointing and runtime resilience methods, primarily operate at the application layer and inevitably cause disruptions in training progress. We propose to address this challenge by introducing fault tolerance at the Remote Direct Memory Access (RDMA) layer and integrating it with existing application-layer techniques. We present SHIFT, a fault-resilient layer over RDMA that enables seamless redirection of RDMA traffic across different intra-host NICs. By allowing applications to continue execution in the presence of network anomalies until the next checkpoint, SHIFT effectively minimizes training progress loss. SHIFT is designed to be application-agnostic, transparent to applications, and low-overhead. Through a carefully designed failure state machine and control flow, unmodified applications such as PyTorch with NCCL can run with RDMA-level fault tolerance. Experimental results demonstrate that SHIFT introduces minimal data path overhead while ensuring application continuity under network failures.

SHIFT: An RDMA Failure-Resilient Layer for Distributed Training

TL;DR

This work addresses the failure-prone nature of large-scale distributed training by introducing SHIFT, a fault-resilient layer that operates at the RDMA layer to tolerate network anomalies without disrupting application progress. SHIFT enables seamless redirection of RDMA traffic across intra-host NICs using perimeter mechanisms such as out-of-band KV attribute transfer, inline work queue rewind, and CQ event-based receiver notification, all while remaining application-agnostic. The paper demonstrates a practical design and implementation (SHIFTLib) with a dual-phase execution model, a four-state send/receive QP workflow, and extensive microbenchmark and PyTorch-based evaluation showing minimal data-path overhead and robust fault tolerance. By combining this RDMA-layer resilience with existing application-layer checkpointing, SHIFT can significantly reduce wasted computation and training time in failure-prone, large-scale environments. Overall, SHIFT offers a practical, scalable solution to improve robustness of distributed AI training on diverse network topologies and failure modes.

Abstract

With gang scheduling in large-scale distributed Large Language Model training, a single network anomaly can propagate and cause complete task failure. The frequency of such anomalies increases with network scale. However, existing fault-tolerance mechanisms, such as checkpointing and runtime resilience methods, primarily operate at the application layer and inevitably cause disruptions in training progress. We propose to address this challenge by introducing fault tolerance at the Remote Direct Memory Access (RDMA) layer and integrating it with existing application-layer techniques. We present SHIFT, a fault-resilient layer over RDMA that enables seamless redirection of RDMA traffic across different intra-host NICs. By allowing applications to continue execution in the presence of network anomalies until the next checkpoint, SHIFT effectively minimizes training progress loss. SHIFT is designed to be application-agnostic, transparent to applications, and low-overhead. Through a carefully designed failure state machine and control flow, unmodified applications such as PyTorch with NCCL can run with RDMA-level fault tolerance. Experimental results demonstrate that SHIFT introduces minimal data path overhead while ensuring application continuity under network failures.

Paper Structure

This paper contains 23 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Comparison of failure-handling technologies for handling network anomalies.
  • Figure 2: Illustration of GPU cluster networks with different inter-host and intra-host topologies. Fig (a) depicts a traditional fat-tree network with a DGX A100 architecture a100aegismetascale, while Fig (b) shows a rail-optimized ncclall2all network with a DGX H100 architecture h100insightsdeepseek. While specific examples are shown, these inter-host and intra-host topologies can be combined in various ways. For simplicity, Fig (b) omits that a DGX H100 connects four PCIe switches to a single CPU. Even in topologies such as H100, intra-host RNICs remain accessible via the CPUs, although with reduced performance.
  • Figure 3: Typical RDMA workflow. The red arrows and blue arrows represent the control flow and the green arrows represent the data flow. QP and CQ are actually metadata memory regions allocated when creating QPs and CQs.
  • Figure 4: Architecture of SHIFT. The blue boxes and the green boxes denote the resources managed by the application and SHIFTLib, respectively. The black arrows and the red arrows denote the data flows and the control flows, respectively.
  • Figure 5: Brief overview of the SHIFT state machine. The indices label the steps taken during the state transitions. Green arrows and boxes denote SHIFT control flow and WRs and WCs.
  • ...and 5 more figures