Table of Contents
Fetching ...

SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training

Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu

TL;DR

This work tackles the activation-memory bottleneck in large language model training by offloading activations to NVMe SSDs and overlapping I/O with GPU computation. The proposed SSDTrain framework uses a tensor cache, PyTorch hooks, and asynchronous GPU–SSD data paths to manage activations with minimal overhead, achieving up to a 47% reduction in activation peak memory. Key contributions include deduplicated, forward/backward-aware offloading, direct GPU–SSD transfers via GDS, and compatibility with major training stacks like Megatron-DeepSpeed. The results indicate substantial memory savings without throughput penalties and show how reduced activation memory can enable larger micro-batches and reduced pipeline bubbles, enabling more scalable LLM training.

Abstract

The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations -- the intermediate tensors produced during forward propagation and reused in backward propagation -- dominate the GPU memory use. This leads to high training overhead such as high weight update cost due to the small micro-batch size. To address this challenge, we propose SSDTrain, an adaptive activation offloading framework to high-capacity NVMe SSDs. SSDTrain reduces GPU memory usage without impacting performance by fully overlapping data transfers with computation. SSDTrain is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication and forwarding to further enhance efficiency. We extensively experimented with popular LLMs like GPT, BERT, and T5. Results demonstrate that SSDTrain reduces 47% of the activation peak memory usage. Meanwhile, SSDTrain perfectly overlaps the I/O with the computation and incurs negligible overhead. Compared with keeping activations in GPU memory and layerwise full recomputation, SSDTrain achieves the best memory savings with negligible throughput loss. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles.

SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training

TL;DR

This work tackles the activation-memory bottleneck in large language model training by offloading activations to NVMe SSDs and overlapping I/O with GPU computation. The proposed SSDTrain framework uses a tensor cache, PyTorch hooks, and asynchronous GPU–SSD data paths to manage activations with minimal overhead, achieving up to a 47% reduction in activation peak memory. Key contributions include deduplicated, forward/backward-aware offloading, direct GPU–SSD transfers via GDS, and compatibility with major training stacks like Megatron-DeepSpeed. The results indicate substantial memory savings without throughput penalties and show how reduced activation memory can enable larger micro-batches and reduced pipeline bubbles, enabling more scalable LLM training.

Abstract

The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations -- the intermediate tensors produced during forward propagation and reused in backward propagation -- dominate the GPU memory use. This leads to high training overhead such as high weight update cost due to the small micro-batch size. To address this challenge, we propose SSDTrain, an adaptive activation offloading framework to high-capacity NVMe SSDs. SSDTrain reduces GPU memory usage without impacting performance by fully overlapping data transfers with computation. SSDTrain is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication and forwarding to further enhance efficiency. We extensively experimented with popular LLMs like GPT, BERT, and T5. Results demonstrate that SSDTrain reduces 47% of the activation peak memory usage. Meanwhile, SSDTrain perfectly overlaps the I/O with the computation and incurs negligible overhead. Compared with keeping activations in GPU memory and layerwise full recomputation, SSDTrain achieves the best memory savings with negligible throughput loss. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles.
Paper Structure (20 sections, 8 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: The growth of FP16 throughput of GPUs for deep learning training is aligned with the model size of LLMs, but GPU memory capacity falls behind theepochaiAnnouncingEpochAI2023. Horizontal axis shows release date. Points represent both Nvidia 100-level GPUs since K100 and Google TPUs.
  • Figure 2: SSDTrain timeline of a step of a 2-microbatch 3-layer (L) model.
  • Figure 3: SSDTrain workflow. SSDTrain components are shown as blue blocks.
  • Figure 4: Tensor cache registers hooks to offload tensors and reload tensors. (a) shows the computational graph. (b) shows the hardware data path. (c) and (d) show the tensor cache state when the pack or unpack hook is triggered.
  • Figure 5: Estimate of SSD lifespan, PCIe write bandwidth and maximal activations size per GPU. Lifespans longer than 5 years are shown on top of the pink bars. ZeRO3 stands for DeepSpeed with stage-3 ZeRO.
  • ...and 3 more figures