GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training
Keyao Zhang, Yiquan Chen, Zhuo Hu, Wenhai Lin, Jiexiong Xu, Wenzhi Chen
TL;DR
GoCkpt tackles the bottleneck of GPU interruption during checkpointing in LLM training by overlapping checkpoint transfers with multiple training steps and reconstructing a consistent CPU-side checkpoint. The approach combines cross-step transfer, gradient data movement, CPU-based parameter updates, and IO bandwidth optimization to hide the checkpoint overhead within the training loop. Empirical results show up to 38.4% throughput improvement over asynchronous checkpointing and up to 86.7% reduction in training interruption time, with near-ideal performance in multi-GPU setups. This work offers a practical pathway to higher utilization and faster training of ever-larger LLMs by rethinking checkpointing as a distributed, overlapped, and CPU-assisted process.
Abstract
The accuracy of large language models (LLMs) improves with increasing model size, but increasing model complexity also poses significant challenges to training stability. Periodic checkpointing is a key mechanism for fault recovery and is widely used in LLM training. However, traditional checkpointing strategies often pause or delay GPU computation during checkpoint saving for checkpoint GPU-CPU transfer, resulting in significant training interruptions and reduced training throughput. To address this issue, we propose GoCkpt, a method to overlap checkpoint saving with multiple training steps and restore the final checkpoint on the CPU. We transfer the checkpoint across multiple steps, each step transfers part of the checkpoint state, and we transfer some of the gradient data used for parameter updates. After the transfer is complete, each partial checkpoint state is updated to a consistent version on the CPU, thus avoiding the checkpoint state inconsistency problem caused by transferring checkpoints across multiple steps. Furthermore, we introduce a transfer optimization strategy to maximize GPU-CPU bandwidth utilization and SSD persistence throughput. This dual optimization overlapping saves across steps and maximizing I/O efficiency significantly reduces invalid training time. Experimental results show that GoCkpt can increase training throughput by up to 38.4% compared to traditional asynchronous checkpoint solutions in the industry. We also find that GoCkpt can reduce training interruption time by 86.7% compared to the state-of-the-art checkpoint transfer methods, which results in a 4.8% throughput improvement.
