Table of Contents
Fetching ...

LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

Taeho Kim, Yanming Wang, Vatshank Chaturvedi, Lokesh Gupta, Seyeon Kim, Yongin Kwon, Sangtae Ha

TL;DR

This work tackles GPU memory bottlenecks in fine-tuning large language models by proposing LLMem, a memory-estimation framework that predicts peak memory usage for single- and multi-GPU distributed fine-tuning and selects the fastest viable method. It extends previous work like DNNMem with explicit modeling for transformer and lm_head memory, mixed-precision, and multi-GPU parallelism (CDP, ADP, TP, DP+TP). The authors demonstrate that LLMem achieves high accuracy, with peak-memory error rates as low as $1.6\%$ on a single GPU and about $3.0\%$ on multi-GPU setups for models over a billion parameters, while DNNMem underperforms in these scenarios. The approach yields a practical decision algorithm that balances memory constraints and fine-tuning speed, enabling faster, OOM-free tuning and informing when CPU offloading may be necessary.

Abstract

Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However, determining the most effective method for achieving rapid fine-tuning while preventing GPU out-of-memory issues in a given environment remains unclear. To address this challenge, we introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods across multiple GPUs and identifies the optimal method. We conduct GPU memory usage estimation prior to fine-tuning, leveraging the fundamental structure of transformer-based decoder models and the memory usage distribution of each method. Experimental results show that LLMem accurately estimates peak GPU memory usage on a single GPU, with error rates of up to 1.6%. Additionally, it shows an average error rate of 3.0% when applying distributed fine-tuning methods to LLMs with more than a billion parameters on multi-GPU setups.

LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

TL;DR

This work tackles GPU memory bottlenecks in fine-tuning large language models by proposing LLMem, a memory-estimation framework that predicts peak memory usage for single- and multi-GPU distributed fine-tuning and selects the fastest viable method. It extends previous work like DNNMem with explicit modeling for transformer and lm_head memory, mixed-precision, and multi-GPU parallelism (CDP, ADP, TP, DP+TP). The authors demonstrate that LLMem achieves high accuracy, with peak-memory error rates as low as on a single GPU and about on multi-GPU setups for models over a billion parameters, while DNNMem underperforms in these scenarios. The approach yields a practical decision algorithm that balances memory constraints and fine-tuning speed, enabling faster, OOM-free tuning and informing when CPU offloading may be necessary.

Abstract

Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However, determining the most effective method for achieving rapid fine-tuning while preventing GPU out-of-memory issues in a given environment remains unclear. To address this challenge, we introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods across multiple GPUs and identifies the optimal method. We conduct GPU memory usage estimation prior to fine-tuning, leveraging the fundamental structure of transformer-based decoder models and the memory usage distribution of each method. Experimental results show that LLMem accurately estimates peak GPU memory usage on a single GPU, with error rates of up to 1.6%. Additionally, it shows an average error rate of 3.0% when applying distributed fine-tuning methods to LLMs with more than a billion parameters on multi-GPU setups.
Paper Structure (18 sections, 9 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 9 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Peak GPU memory estimates per total parameter size on a single GPU
  • Figure 2: Illustration of tensors using GPU memory while fine-tuning the pre-trained model ren2021zero
  • Figure 3: Basic structure of transformer-based decoder model vaswani2017attention. As shown in Figure \ref{['fig:tensors']}, the parameters in the transformer part are managed using chunk-based memory, while the lm_head part, responsible for deriving the output, consumes GPU memory based on its actual size.
  • Figure 4: Peak GPU memory computation for different distributed fine-tuning methods.
  • Figure 5: Advanced DP gathers the entire param fp16, while TP maintains the sharded param fp16 intact before entering the computation process.
  • ...and 4 more figures