SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training
Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu
TL;DR
This work tackles the activation-memory bottleneck in large language model training by offloading activations to NVMe SSDs and overlapping I/O with GPU computation. The proposed SSDTrain framework uses a tensor cache, PyTorch hooks, and asynchronous GPU–SSD data paths to manage activations with minimal overhead, achieving up to a 47% reduction in activation peak memory. Key contributions include deduplicated, forward/backward-aware offloading, direct GPU–SSD transfers via GDS, and compatibility with major training stacks like Megatron-DeepSpeed. The results indicate substantial memory savings without throughput penalties and show how reduced activation memory can enable larger micro-batches and reduced pipeline bubbles, enabling more scalable LLM training.
Abstract
The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations -- the intermediate tensors produced during forward propagation and reused in backward propagation -- dominate the GPU memory use. This leads to high training overhead such as high weight update cost due to the small micro-batch size. To address this challenge, we propose SSDTrain, an adaptive activation offloading framework to high-capacity NVMe SSDs. SSDTrain reduces GPU memory usage without impacting performance by fully overlapping data transfers with computation. SSDTrain is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication and forwarding to further enhance efficiency. We extensively experimented with popular LLMs like GPT, BERT, and T5. Results demonstrate that SSDTrain reduces 47% of the activation peak memory usage. Meanwhile, SSDTrain perfectly overlaps the I/O with the computation and incurs negligible overhead. Compared with keeping activations in GPU memory and layerwise full recomputation, SSDTrain achieves the best memory savings with negligible throughput loss. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles.
