Table of Contents
Fetching ...

GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training

Keyao Zhang, Yiquan Chen, Zhuo Hu, Wenhai Lin, Jiexiong Xu, Wenzhi Chen

TL;DR

GoCkpt tackles the bottleneck of GPU interruption during checkpointing in LLM training by overlapping checkpoint transfers with multiple training steps and reconstructing a consistent CPU-side checkpoint. The approach combines cross-step transfer, gradient data movement, CPU-based parameter updates, and IO bandwidth optimization to hide the checkpoint overhead within the training loop. Empirical results show up to 38.4% throughput improvement over asynchronous checkpointing and up to 86.7% reduction in training interruption time, with near-ideal performance in multi-GPU setups. This work offers a practical pathway to higher utilization and faster training of ever-larger LLMs by rethinking checkpointing as a distributed, overlapped, and CPU-assisted process.

Abstract

The accuracy of large language models (LLMs) improves with increasing model size, but increasing model complexity also poses significant challenges to training stability. Periodic checkpointing is a key mechanism for fault recovery and is widely used in LLM training. However, traditional checkpointing strategies often pause or delay GPU computation during checkpoint saving for checkpoint GPU-CPU transfer, resulting in significant training interruptions and reduced training throughput. To address this issue, we propose GoCkpt, a method to overlap checkpoint saving with multiple training steps and restore the final checkpoint on the CPU. We transfer the checkpoint across multiple steps, each step transfers part of the checkpoint state, and we transfer some of the gradient data used for parameter updates. After the transfer is complete, each partial checkpoint state is updated to a consistent version on the CPU, thus avoiding the checkpoint state inconsistency problem caused by transferring checkpoints across multiple steps. Furthermore, we introduce a transfer optimization strategy to maximize GPU-CPU bandwidth utilization and SSD persistence throughput. This dual optimization overlapping saves across steps and maximizing I/O efficiency significantly reduces invalid training time. Experimental results show that GoCkpt can increase training throughput by up to 38.4% compared to traditional asynchronous checkpoint solutions in the industry. We also find that GoCkpt can reduce training interruption time by 86.7% compared to the state-of-the-art checkpoint transfer methods, which results in a 4.8% throughput improvement.

GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training

TL;DR

GoCkpt tackles the bottleneck of GPU interruption during checkpointing in LLM training by overlapping checkpoint transfers with multiple training steps and reconstructing a consistent CPU-side checkpoint. The approach combines cross-step transfer, gradient data movement, CPU-based parameter updates, and IO bandwidth optimization to hide the checkpoint overhead within the training loop. Empirical results show up to 38.4% throughput improvement over asynchronous checkpointing and up to 86.7% reduction in training interruption time, with near-ideal performance in multi-GPU setups. This work offers a practical pathway to higher utilization and faster training of ever-larger LLMs by rethinking checkpointing as a distributed, overlapped, and CPU-assisted process.

Abstract

The accuracy of large language models (LLMs) improves with increasing model size, but increasing model complexity also poses significant challenges to training stability. Periodic checkpointing is a key mechanism for fault recovery and is widely used in LLM training. However, traditional checkpointing strategies often pause or delay GPU computation during checkpoint saving for checkpoint GPU-CPU transfer, resulting in significant training interruptions and reduced training throughput. To address this issue, we propose GoCkpt, a method to overlap checkpoint saving with multiple training steps and restore the final checkpoint on the CPU. We transfer the checkpoint across multiple steps, each step transfers part of the checkpoint state, and we transfer some of the gradient data used for parameter updates. After the transfer is complete, each partial checkpoint state is updated to a consistent version on the CPU, thus avoiding the checkpoint state inconsistency problem caused by transferring checkpoints across multiple steps. Furthermore, we introduce a transfer optimization strategy to maximize GPU-CPU bandwidth utilization and SSD persistence throughput. This dual optimization overlapping saves across steps and maximizing I/O efficiency significantly reduces invalid training time. Experimental results show that GoCkpt can increase training throughput by up to 38.4% compared to traditional asynchronous checkpoint solutions in the industry. We also find that GoCkpt can reduce training interruption time by 86.7% compared to the state-of-the-art checkpoint transfer methods, which results in a 4.8% throughput improvement.

Paper Structure

This paper contains 35 sections, 3 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Process of LLM Training (M+O-Model and Optimizer parameters, G-Gradients, N-Model and Optimizer version in step N)
  • Figure 2: Traditional single-step snapshot (a) and GoCkpt multi-step snapshot (b) overview
  • Figure 3: Compute transfer overlap and CPU-assisted updates
  • Figure 4: Computation and data transfer flow of various checkpointing schemes. F-Forward, B-Backward, U-Update, T-Time of one single step, M+O-When GPU transfers Model and Optimizer parameters to CPU, G-When GPU transfers gradients to CPU
  • Figure 5: Checkpoint throughput and stall time for different checkpoint frequency settings (higher is better)
  • ...and 5 more figures