ProTrain: Efficient LLM Training via Memory-Aware Techniques

Hanmei Yang; Jin Zhou; Yao Fu; Xiaoqun Wang; Ramine Roane; Hui Guan; Tongping Liu

ProTrain: Efficient LLM Training via Memory-Aware Techniques

Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu

TL;DR

ProTrain addresses the memory bottlenecks in large-language-model training by coordinating memory, computation, and IO through fine-grained chunk-based state management and block-wise activation management, guided by a memory-aware runtime profiler. It preserves training accuracy while achieving substantial throughput gains over state-of-the-art systems and enabling training of models up to 70B parameters on a single A100. The approach combines an adaptive memory manager with a predictive runtime/memory estimator and an exhaustive yet pruned configuration search, demonstrating strong scalability across GPUs and model sizes. Overall, ProTrain significantly lowers the hardware barriers to large-scale LLM training, particularly for resource-constrained settings.

Abstract

It is extremely memory-hungry to train Large Language Models (LLM). To solve this problem, existing work exploits the combination of CPU and GPU for the training process, such as ZeRO-Offload. Such a technique largely democratizes billion-scale model training, making it possible to train with few consumer graphics cards. However, based on our observation, existing frameworks often provide coarse-grained memory management and require experienced experts in configuration tuning, leading to suboptimal hardware utilization and performance. This paper proposes ProTrain, a novel training system that intelligently balances memory usage and performance by coordinating memory, computation, and IO. ProTrain achieves adaptive memory management through Chunk-Based Model State Management and Block-Wise Activation Management, guided by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$\times$ to 2.71$\times$ compared to the SOTA training systems.

ProTrain: Efficient LLM Training via Memory-Aware Techniques

TL;DR

Abstract

to 2.71

compared to the SOTA training systems.

Paper Structure (43 sections, 3 equations, 7 figures, 3 tables)

This paper contains 43 sections, 3 equations, 7 figures, 3 tables.

Introduction
Background
Deep Learning Model Training
ZeRO Techniques
ProTrain Design
Chunk-Based Model State Management
Block-Wise Activation Management
Memory-Aware Runtime Profiling
Adaptive Memory Management
Chunk-Aware Runtime Estimator
Peak Memory Usage Estimator
Optimal Configuration Search
Experiments
Experimental Setup
Workloads
...and 28 more sections

Figures (7)

Figure 1: Key Chunk Operations in Chunk-Based Model State Management.
Figure 2: Block-Wise Activation Management Layout and Memory Usage Trend
Figure 3: Maximum Training Throughput on four RTX 3090 GPUs (upper) and A100 GPUs (bottom). The notation "$\times$" indicates failure to train due to out of memory.
Figure 4: Scalability of performance on RTX 3090 GPUs (a) Maximum throughput across different numbers of GPUs (b) Step time breakdown for different batch sizes
Figure 5: Effectiveness of adaptive memory management on four RTX 3090 GPUs (a) Runtime comparison of ProTrain w/ and w/o adaptive memory management (b) Comparison of ProTrain's actual and predicted runtime across various configurations
...and 2 more figures

ProTrain: Efficient LLM Training via Memory-Aware Techniques

TL;DR

Abstract

ProTrain: Efficient LLM Training via Memory-Aware Techniques

Authors

TL;DR

Abstract

Table of Contents

Figures (7)