Table of Contents
Fetching ...

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

TL;DR

Checkpointing large language models at scale imposes substantial I/O overheads that can stall training. DataStates-LLM introduces lazy asynchronous multi-level checkpointing that exploits the immutability of model parameters and optimizer state during forward and backward passes, enabling background copies and streaming flushes to storage. The approach includes preallocated, pinned host buffers, shard coalescing, and hierarchical asynchronous consolidation integrated as a DeepSpeed/Megatron-LM extension, with extensive evaluation on up to 180 GPUs showing major throughput and end-to-end training improvements. This work significantly improves resilience and efficiency of LLM training on HPC systems, enabling higher-frequency checkpointing with reduced overheads and faster job completion.

Abstract

LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can be applied at high frequency (up to the granularity of individual iterations) without significant impact on the training process. Specifically, we introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and optimizer state shards remain immutable for extended periods of time, which makes it possible to copy their content in the background with minimal interference during the training process. We evaluate our approach at scales of up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. The results show up to 48$\times$ faster checkpointing and 2.2$\times$ faster end-to-end training runtime compared with the state-of-art checkpointing approaches.

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

TL;DR

Checkpointing large language models at scale imposes substantial I/O overheads that can stall training. DataStates-LLM introduces lazy asynchronous multi-level checkpointing that exploits the immutability of model parameters and optimizer state during forward and backward passes, enabling background copies and streaming flushes to storage. The approach includes preallocated, pinned host buffers, shard coalescing, and hierarchical asynchronous consolidation integrated as a DeepSpeed/Megatron-LM extension, with extensive evaluation on up to 180 GPUs showing major throughput and end-to-end training improvements. This work significantly improves resilience and efficiency of LLM training on HPC systems, enabling higher-frequency checkpointing with reduced overheads and faster job completion.

Abstract

LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can be applied at high frequency (up to the granularity of individual iterations) without significant impact on the training process. Specifically, we introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and optimizer state shards remain immutable for extended periods of time, which makes it possible to copy their content in the background with minimal interference during the training process. We evaluate our approach at scales of up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. The results show up to 48 faster checkpointing and 2.2 faster end-to-end training runtime compared with the state-of-art checkpointing approaches.
Paper Structure (46 sections, 12 figures, 1 table)

This paper contains 46 sections, 12 figures, 1 table.

Figures (12)

  • Figure 1: Data, pipeline, and tensor parallel runtime training. Compute node configuration consisting of four A100-40GB GPUs.
  • Figure 2: Sharding of checkpoints during training of conventional DNNs and LLMs for different degrees of pipeline (PP), tensor (TP), and data (DP) parallelism.
  • Figure 3: Aggregate checkpoint sizes of different model sizes and average checkpoint size per GPU.
  • Figure 4: Different iteration phases. Model and optimizer states are immutable during forward and backward passes.
  • Figure 5: Overlapping LLM training with checkpointing using different approaches.
  • ...and 7 more figures