Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing
Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen Chu
TL;DR
This work addresses fault-tolerant training at scale for hybrid-parallel LM pretraining, where traditional in-memory checkpointing incurs prohibitive overhead. It introduces REFT, a two-part framework with REFT-save for in-memory checkpoint saving and REFT-load for in-memory loading, augmented by Hierarchical Asynchronous Snapshotting (HAS) and intra-node redundancy techniques (ARC, AEC, AOR) to minimize interference with training and ensure robust recovery. The main contributions are the design of HAS for near-zero snapshotting overhead, multiple complementary protection schemes tailored for HP (ARC, AEC, AOR), and an elastic, NFS-light workflow that enables fast restarts and scalable loading. Evaluations on Llama-2 models across NVIDIA Cluster and Frontier show substantial gains: HAS reduces iteration-time overhead by over 17% with near-zero overhead, ARC alone achieves about 1% overhead, and REFT-load provides up to 12.36x faster in-memory loading than NFS, enabling practical, large-scale fault-tolerant HP LM training with improved survival probabilities and reduced recomputation. Overall, REFT demonstrates that carefully co-designed in-memory checkpointing can deliver fast, reliable fault tolerance at scale for HP LM training, with direct applicability to industrial training pipelines.
Abstract
To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition between checkpointing and training, which can work under DP but can hardly scale under resource-intensive HP. To ensure low checkpointing overhead for hybrid-parallel training, this paper introduces a distributed in-memory checkpointing system with near-zero in-memory saving overhead. It strives from two aspects to mitigate the on-host resource competition caused by in-memory checkpointing: (1) It introduces Hierarchical Asynchronous Snapshotting Coordination in the checkpoint saving stage. This approach uses three-level asynchronous on-device scheduling to enhance parallelism between snapshotting and training, thereby minimizing snapshotting overhead. (2) It proposes Hybrid In-memory Checkpoint Protection to enhance checkpoint completeness during hardware failures. Unlike methods that require inter-node communications, which may block training under HP, it creates intra-node redundancy with efficient resource utilization, protecting training against hardware failures with minimal overhead. With these methods, this work enables fast restart for failed HP training with Distributed In-memory Checkpoint Loading, bypassing inefficiencies in NFS reads. In our evaluation, we achieve zero in-memory checkpoint saving overhead on Frontier while training Llama-2-34B on 256 MI250X devices (512 GPUs).
