Table of Contents
Fetching ...

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, Chuan Wu

TL;DR

ByteCheckpoint tackles the challenge of efficient, scalable checkpointing for large foundation model development by introducing a unified, parallelism-agnostic representation and a generic workflow that supports multiple training frameworks and storage backends. It enables load-time checkpoint resharding through a decoupled metadata model (ShardMeta, ByteMeta) and a coordinated planning/execution engine, coupled with full-stack I/O optimizations, asynchronous pipelines, and advanced monitoring. The system delivers dramatic improvements over baselines, reducing checkpoint stalls by up to $161.50\times$, speeding saving/loading by up to $9.96\times$/$8.80\times$, and achieving 3.88x faster loading on average, while scaling to thousands of GPUs in production. These gains translate into lower end-to-end training downtime and more flexible resource utilization, making ByteCheckpoint a practical solution for real-world LFM development; ETTR improvements of $1.16$–$1.29\times$ illustrate substantial end-to-end efficiency gains. $ETTR$ is defined as the ratio of productive training time to wallclock time, and ByteCheckpoint reduces wasted time via load-time resharding and asynchronous I/O, driving improvements in end-to-end training efficiency.

Abstract

Checkpointing to preserve training states is crucial during the development of Large Foundation Models (LFMs), for training resumption upon various failures or changes in GPU resources and parallelism configurations. In addition, saved checkpoints are dispatched to evaluation tasks or transferred across different training stages (e.g., from pre-training to post-training). All these scenarios require resharding distributed checkpoints from one parallelism to another. In production environments, different LFMs are trained with various frameworks and storage backends, depending on model sizes and training scales. A high-performance checkpointing system is needed to enable efficient checkpoint management at scale throughout the lifecycle of LFM development. We introduce ByteCheckpoint, an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint features: a parallelism-agnostic checkpoint representation that enables efficient load-time checkpoint resharding; a generic checkpoint saving/loading workflow to accommodate multiple training frameworks and support different storage backends; full-stack optimizations to ensure high I/O efficiency and scalability; a suite of monitoring tools to streamline large-scale performance analysis and bottleneck detection. Compared to existing open-source checkpointing systems [52, 58], ByteCheckpoint significantly reduces runtime checkpoint stalls, achieving an average reduction of 54.20x. For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96x and 8.80x, respectively.

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

TL;DR

ByteCheckpoint tackles the challenge of efficient, scalable checkpointing for large foundation model development by introducing a unified, parallelism-agnostic representation and a generic workflow that supports multiple training frameworks and storage backends. It enables load-time checkpoint resharding through a decoupled metadata model (ShardMeta, ByteMeta) and a coordinated planning/execution engine, coupled with full-stack I/O optimizations, asynchronous pipelines, and advanced monitoring. The system delivers dramatic improvements over baselines, reducing checkpoint stalls by up to , speeding saving/loading by up to /, and achieving 3.88x faster loading on average, while scaling to thousands of GPUs in production. These gains translate into lower end-to-end training downtime and more flexible resource utilization, making ByteCheckpoint a practical solution for real-world LFM development; ETTR improvements of illustrate substantial end-to-end efficiency gains. is defined as the ratio of productive training time to wallclock time, and ByteCheckpoint reduces wasted time via load-time resharding and asynchronous I/O, driving improvements in end-to-end training efficiency.

Abstract

Checkpointing to preserve training states is crucial during the development of Large Foundation Models (LFMs), for training resumption upon various failures or changes in GPU resources and parallelism configurations. In addition, saved checkpoints are dispatched to evaluation tasks or transferred across different training stages (e.g., from pre-training to post-training). All these scenarios require resharding distributed checkpoints from one parallelism to another. In production environments, different LFMs are trained with various frameworks and storage backends, depending on model sizes and training scales. A high-performance checkpointing system is needed to enable efficient checkpoint management at scale throughout the lifecycle of LFM development. We introduce ByteCheckpoint, an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint features: a parallelism-agnostic checkpoint representation that enables efficient load-time checkpoint resharding; a generic checkpoint saving/loading workflow to accommodate multiple training frameworks and support different storage backends; full-stack optimizations to ensure high I/O efficiency and scalability; a suite of monitoring tools to streamline large-scale performance analysis and bottleneck detection. Compared to existing open-source checkpointing systems [52, 58], ByteCheckpoint significantly reduces runtime checkpoint stalls, achieving an average reduction of 54.20x. For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96x and 8.80x, respectively.
Paper Structure (30 sections, 2 equations, 17 figures, 9 tables)

This paper contains 30 sections, 2 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: An overview of the training pipeline of LFM.
  • Figure 2: Checkpoint resharding scenarios in LFM training. We only show GPU states for clarity of the figure.
  • Figure 3: Checkpointing efficiency impacts failure recovery and evaluation tasks. D2H denotes the Device-to-Host copy.
  • Figure 4: Architecture of ByteCheckpoint.
  • Figure 5: Examples of using ByteCheckpoint's APIs.
  • ...and 12 more figures