Table of Contents
Fetching ...

STAlloc: Enhancing Memory Efficiency in Large-Scale Model Training with Spatio-Temporal Planning

Zixiao Huang, Junhao Hu, Hao Lin, Chunyang Zhu, Yueran Tang, Quanlu Zhang, Zhen Guo, Zhenhua Li, Shengen Yan, Zhenhua Zhu, Guohao Dai, Yu Wang

TL;DR

STAlloc tackles GPU memory fragmentation in large-scale LLM training by merging offline spatio-temporal memory planning with online dynamic allocation. It identifies spatial and temporal regularities in memory requests to produce near-optimal Static Allocation Plans and utilizes Dynamic Reusable Space to accommodate MoE and other dynamic requests, all within a PyTorch PluggableAllocator. Empirical results show an average fragmentation reduction of $85.1\%$ (up to $100\%$) and memory savings up to $56.3$ GB, with negligible end-to-end throughput impact and up to $32.5\%$ throughput gains in high-throughput configurations. The approach demonstrates broad applicability across dense and MoE models on diverse hardware and frameworks, highlighting a scalable path to more memory-efficient large-scale model training.

Abstract

The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Such fragmentation stems from the use of online GPU memory allocators in popular deep learning frameworks like PyTorch, which disregard tensor lifespans. As a result, this inefficiency can waste as much as 43% of memory and trigger out-of-memory errors, undermining the effectiveness of optimization methods. To address this, we introduce STAlloc, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STAlloc introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch memory allocator, STAlloc reduces fragmentation ratio on average by 85.1% (up to 100%) across both dense and MoE models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves throughput performance by up to 32.5%.

STAlloc: Enhancing Memory Efficiency in Large-Scale Model Training with Spatio-Temporal Planning

TL;DR

STAlloc tackles GPU memory fragmentation in large-scale LLM training by merging offline spatio-temporal memory planning with online dynamic allocation. It identifies spatial and temporal regularities in memory requests to produce near-optimal Static Allocation Plans and utilizes Dynamic Reusable Space to accommodate MoE and other dynamic requests, all within a PyTorch PluggableAllocator. Empirical results show an average fragmentation reduction of (up to ) and memory savings up to GB, with negligible end-to-end throughput impact and up to throughput gains in high-throughput configurations. The approach demonstrates broad applicability across dense and MoE models on diverse hardware and frameworks, highlighting a scalable path to more memory-efficient large-scale model training.

Abstract

The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Such fragmentation stems from the use of online GPU memory allocators in popular deep learning frameworks like PyTorch, which disregard tensor lifespans. As a result, this inefficiency can waste as much as 43% of memory and trigger out-of-memory errors, undermining the effectiveness of optimization methods. To address this, we introduce STAlloc, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STAlloc introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch memory allocator, STAlloc reduces fragmentation ratio on average by 85.1% (up to 100%) across both dense and MoE models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves throughput performance by up to 32.5%.

Paper Structure

This paper contains 25 sections, 9 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: (a) Memory fragmentation in interleaved allocation. (b) Memory and training throughput of different training configurations for Llama2-7B on 8 NVIDIA A800 GPUs.
  • Figure 2: Comparison of PyTorch memory efficiency with no optimizations, recomputation, and Virtual Pipeline.
  • Figure 3: Allocation size distribution during training. As shown in the figure, there are only around 32 distinct tensor sizes among different training configurations.
  • Figure 4: Allocation classification based on temporal characteristic. The temporal characteristic of activation tensors are influenced by training optimization techniques.
  • Figure 5: Workflow of STAlloc.
  • ...and 8 more figures