Table of Contents
Fetching ...

ProTrain: Efficient LLM Training via Memory-Aware Techniques

Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu

TL;DR

ProTrain addresses the memory bottlenecks in large-language-model training by coordinating memory, computation, and IO through fine-grained chunk-based state management and block-wise activation management, guided by a memory-aware runtime profiler. It preserves training accuracy while achieving substantial throughput gains over state-of-the-art systems and enabling training of models up to 70B parameters on a single A100. The approach combines an adaptive memory manager with a predictive runtime/memory estimator and an exhaustive yet pruned configuration search, demonstrating strong scalability across GPUs and model sizes. Overall, ProTrain significantly lowers the hardware barriers to large-scale LLM training, particularly for resource-constrained settings.

Abstract

It is extremely memory-hungry to train Large Language Models (LLM). To solve this problem, existing work exploits the combination of CPU and GPU for the training process, such as ZeRO-Offload. Such a technique largely democratizes billion-scale model training, making it possible to train with few consumer graphics cards. However, based on our observation, existing frameworks often provide coarse-grained memory management and require experienced experts in configuration tuning, leading to suboptimal hardware utilization and performance. This paper proposes ProTrain, a novel training system that intelligently balances memory usage and performance by coordinating memory, computation, and IO. ProTrain achieves adaptive memory management through Chunk-Based Model State Management and Block-Wise Activation Management, guided by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$\times$ to 2.71$\times$ compared to the SOTA training systems.

ProTrain: Efficient LLM Training via Memory-Aware Techniques

TL;DR

ProTrain addresses the memory bottlenecks in large-language-model training by coordinating memory, computation, and IO through fine-grained chunk-based state management and block-wise activation management, guided by a memory-aware runtime profiler. It preserves training accuracy while achieving substantial throughput gains over state-of-the-art systems and enabling training of models up to 70B parameters on a single A100. The approach combines an adaptive memory manager with a predictive runtime/memory estimator and an exhaustive yet pruned configuration search, demonstrating strong scalability across GPUs and model sizes. Overall, ProTrain significantly lowers the hardware barriers to large-scale LLM training, particularly for resource-constrained settings.

Abstract

It is extremely memory-hungry to train Large Language Models (LLM). To solve this problem, existing work exploits the combination of CPU and GPU for the training process, such as ZeRO-Offload. Such a technique largely democratizes billion-scale model training, making it possible to train with few consumer graphics cards. However, based on our observation, existing frameworks often provide coarse-grained memory management and require experienced experts in configuration tuning, leading to suboptimal hardware utilization and performance. This paper proposes ProTrain, a novel training system that intelligently balances memory usage and performance by coordinating memory, computation, and IO. ProTrain achieves adaptive memory management through Chunk-Based Model State Management and Block-Wise Activation Management, guided by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43 to 2.71 compared to the SOTA training systems.
Paper Structure (43 sections, 3 equations, 7 figures, 3 tables)

This paper contains 43 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Key Chunk Operations in Chunk-Based Model State Management.
  • Figure 2: Block-Wise Activation Management Layout and Memory Usage Trend
  • Figure 3: Maximum Training Throughput on four RTX 3090 GPUs (upper) and A100 GPUs (bottom). The notation "$\times$" indicates failure to train due to out of memory.
  • Figure 4: Scalability of performance on RTX 3090 GPUs (a) Maximum throughput across different numbers of GPUs (b) Step time breakdown for different batch sizes
  • Figure 5: Effectiveness of adaptive memory management on four RTX 3090 GPUs (a) Runtime comparison of ProTrain w/ and w/o adaptive memory management (b) Comparison of ProTrain's actual and predicted runtime across various configurations
  • ...and 2 more figures