Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

Yuxin Wang; Xueze Kang; Shaohuai Shi; Xin He; Zhenheng Tang; Xinglin Pan; Yang Zheng; Xiaoyu Wu; Amelie Chi Zhou; Bingsheng He; Xiaowen Chu

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen Chu

TL;DR

This work addresses fault-tolerant training at scale for hybrid-parallel LM pretraining, where traditional in-memory checkpointing incurs prohibitive overhead. It introduces REFT, a two-part framework with REFT-save for in-memory checkpoint saving and REFT-load for in-memory loading, augmented by Hierarchical Asynchronous Snapshotting (HAS) and intra-node redundancy techniques (ARC, AEC, AOR) to minimize interference with training and ensure robust recovery. The main contributions are the design of HAS for near-zero snapshotting overhead, multiple complementary protection schemes tailored for HP (ARC, AEC, AOR), and an elastic, NFS-light workflow that enables fast restarts and scalable loading. Evaluations on Llama-2 models across NVIDIA Cluster and Frontier show substantial gains: HAS reduces iteration-time overhead by over 17% with near-zero overhead, ARC alone achieves about 1% overhead, and REFT-load provides up to 12.36x faster in-memory loading than NFS, enabling practical, large-scale fault-tolerant HP LM training with improved survival probabilities and reduced recomputation. Overall, REFT demonstrates that carefully co-designed in-memory checkpointing can deliver fast, reliable fault tolerance at scale for HP LM training, with direct applicability to industrial training pipelines.

Abstract

To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition between checkpointing and training, which can work under DP but can hardly scale under resource-intensive HP. To ensure low checkpointing overhead for hybrid-parallel training, this paper introduces a distributed in-memory checkpointing system with near-zero in-memory saving overhead. It strives from two aspects to mitigate the on-host resource competition caused by in-memory checkpointing: (1) It introduces Hierarchical Asynchronous Snapshotting Coordination in the checkpoint saving stage. This approach uses three-level asynchronous on-device scheduling to enhance parallelism between snapshotting and training, thereby minimizing snapshotting overhead. (2) It proposes Hybrid In-memory Checkpoint Protection to enhance checkpoint completeness during hardware failures. Unlike methods that require inter-node communications, which may block training under HP, it creates intra-node redundancy with efficient resource utilization, protecting training against hardware failures with minimal overhead. With these methods, this work enables fast restart for failed HP training with Distributed In-memory Checkpoint Loading, bypassing inefficiencies in NFS reads. In our evaluation, we achieve zero in-memory checkpoint saving overhead on Frontier while training Llama-2-34B on 256 MI250X devices (512 GPUs).

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

TL;DR

Abstract

Paper Structure (43 sections, 6 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 43 sections, 6 equations, 12 figures, 3 tables, 1 algorithm.

Introduction
Preliminaries and Motivations
Parallel Training Techniques for Large Models
Hybrid Parallel Training
Failures in Hybrid Parallelism
In-Memory Checkpointing for LM Training
Limitations and Motivations
Limitation in Existing Snapshotting
Limitation in Existing Snapshot Protecting
Design Overview
REFT: Reliable and Efficient In-memory Fault Tolerance
Distributed Snapshotting with REFT-save
Global Parameter Sharding
Hierarchical Asynchronous Snapshotting (HAS)
Distributed In-memory Protecting with REFT-save
...and 28 more sections

Figures (12)

Figure 1: Existing in-memory checkpoint snapshotting and protecting methods introduce severe overhead to LM training. Details of optimizations will be discussed in Section \ref{['sec:back']}.
Figure 2: Resource competition between training and snapshotting
Figure 3: Asynchronous in-memory checkpoint overhead
Figure 4: System Overview: REFT-save employs two engines to snapshot and protect parameters in host memory, generating in-memory checkpoints. REFT-load utilizes Distributed In-memory Loading Engine for fast failure recovery.
Figure 5: An $SG$ (Sharding Group) refers to a group where the parameters of assigned partitions are saved. In this example, we use TP intra-node and PP inter-node, a common 3D parallel setting narayanan-sc21zhang2022opt. The model partition across TP and PP within the same DP differs. In 3D parallel pretraining, micro batches are fed into the LLM from the left side, passing through forward and backward passes. All nodes in the same PP state form an $SG$, e.g., all $PP_0$ nodes form $SG_0$.
...and 7 more figures

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

TL;DR

Abstract

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

Authors

TL;DR

Abstract

Table of Contents

Figures (12)