LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs
Taeho Kim, Yanming Wang, Vatshank Chaturvedi, Lokesh Gupta, Seyeon Kim, Yongin Kwon, Sangtae Ha
TL;DR
This work tackles GPU memory bottlenecks in fine-tuning large language models by proposing LLMem, a memory-estimation framework that predicts peak memory usage for single- and multi-GPU distributed fine-tuning and selects the fastest viable method. It extends previous work like DNNMem with explicit modeling for transformer and lm_head memory, mixed-precision, and multi-GPU parallelism (CDP, ADP, TP, DP+TP). The authors demonstrate that LLMem achieves high accuracy, with peak-memory error rates as low as $1.6\%$ on a single GPU and about $3.0\%$ on multi-GPU setups for models over a billion parameters, while DNNMem underperforms in these scenarios. The approach yields a practical decision algorithm that balances memory constraints and fine-tuning speed, enabling faster, OOM-free tuning and informing when CPU offloading may be necessary.
Abstract
Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However, determining the most effective method for achieving rapid fine-tuning while preventing GPU out-of-memory issues in a given environment remains unclear. To address this challenge, we introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods across multiple GPUs and identifies the optimal method. We conduct GPU memory usage estimation prior to fine-tuning, leveraging the fundamental structure of transformer-based decoder models and the memory usage distribution of each method. Experimental results show that LLMem accurately estimates peak GPU memory usage on a single GPU, with error rates of up to 1.6%. Additionally, it shows an average error rate of 3.0% when applying distributed fine-tuning methods to LLMs with more than a billion parameters on multi-GPU setups.
