SHIFT: An RDMA Failure-Resilient Layer for Distributed Training
Shengkai Lin, Kairui Zhou, Yibo Wu, Hongtao Zhang, Qinwei Yang, Wei Zhang, Arvind Krishnamurthy, Shizhen Zhao
TL;DR
This work addresses the failure-prone nature of large-scale distributed training by introducing SHIFT, a fault-resilient layer that operates at the RDMA layer to tolerate network anomalies without disrupting application progress. SHIFT enables seamless redirection of RDMA traffic across intra-host NICs using perimeter mechanisms such as out-of-band KV attribute transfer, inline work queue rewind, and CQ event-based receiver notification, all while remaining application-agnostic. The paper demonstrates a practical design and implementation (SHIFTLib) with a dual-phase execution model, a four-state send/receive QP workflow, and extensive microbenchmark and PyTorch-based evaluation showing minimal data-path overhead and robust fault tolerance. By combining this RDMA-layer resilience with existing application-layer checkpointing, SHIFT can significantly reduce wasted computation and training time in failure-prone, large-scale environments. Overall, SHIFT offers a practical, scalable solution to improve robustness of distributed AI training on diverse network topologies and failure modes.
Abstract
With gang scheduling in large-scale distributed Large Language Model training, a single network anomaly can propagate and cause complete task failure. The frequency of such anomalies increases with network scale. However, existing fault-tolerance mechanisms, such as checkpointing and runtime resilience methods, primarily operate at the application layer and inevitably cause disruptions in training progress. We propose to address this challenge by introducing fault tolerance at the Remote Direct Memory Access (RDMA) layer and integrating it with existing application-layer techniques. We present SHIFT, a fault-resilient layer over RDMA that enables seamless redirection of RDMA traffic across different intra-host NICs. By allowing applications to continue execution in the presence of network anomalies until the next checkpoint, SHIFT effectively minimizes training progress loss. SHIFT is designed to be application-agnostic, transparent to applications, and low-overhead. Through a carefully designed failure state machine and control flow, unmodified applications such as PyTorch with NCCL can run with RDMA-level fault tolerance. Experimental results demonstrate that SHIFT introduces minimal data path overhead while ensuring application continuity under network failures.
