LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

Taeho Kim; Yanming Wang; Vatshank Chaturvedi; Lokesh Gupta; Seyeon Kim; Yongin Kwon; Sangtae Ha

LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

Taeho Kim, Yanming Wang, Vatshank Chaturvedi, Lokesh Gupta, Seyeon Kim, Yongin Kwon, Sangtae Ha

TL;DR

This work tackles GPU memory bottlenecks in fine-tuning large language models by proposing LLMem, a memory-estimation framework that predicts peak memory usage for single- and multi-GPU distributed fine-tuning and selects the fastest viable method. It extends previous work like DNNMem with explicit modeling for transformer and lm_head memory, mixed-precision, and multi-GPU parallelism (CDP, ADP, TP, DP+TP). The authors demonstrate that LLMem achieves high accuracy, with peak-memory error rates as low as $1.6\%$ on a single GPU and about $3.0\%$ on multi-GPU setups for models over a billion parameters, while DNNMem underperforms in these scenarios. The approach yields a practical decision algorithm that balances memory constraints and fine-tuning speed, enabling faster, OOM-free tuning and informing when CPU offloading may be necessary.

Abstract

Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However, determining the most effective method for achieving rapid fine-tuning while preventing GPU out-of-memory issues in a given environment remains unclear. To address this challenge, we introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods across multiple GPUs and identifies the optimal method. We conduct GPU memory usage estimation prior to fine-tuning, leveraging the fundamental structure of transformer-based decoder models and the memory usage distribution of each method. Experimental results show that LLMem accurately estimates peak GPU memory usage on a single GPU, with error rates of up to 1.6%. Additionally, it shows an average error rate of 3.0% when applying distributed fine-tuning methods to LLMs with more than a billion parameters on multi-GPU setups.

LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

TL;DR

on a single GPU and about

on multi-GPU setups for models over a billion parameters, while DNNMem underperforms in these scenarios. The approach yields a practical decision algorithm that balances memory constraints and fine-tuning speed, enabling faster, OOM-free tuning and informing when CPU offloading may be necessary.

Abstract

Paper Structure (18 sections, 9 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 9 equations, 9 figures, 3 tables, 1 algorithm.

Introduction
Related Works
GPU Memory Estimation
Distributed Fine-Tuning with GPUs
Motivation
DNNMem Implementation
Limitations of DNNMem for LLM Fine-Tuning Memory Estimation
Single-GPU Memory Usage Estimation
Workflow for Fine-Tuning Pre-Trained Models
Memory Consumption with Structure of Transformer-based Decoder Model
Multi-GPU Memory Usage Estimation
Distributed Fine-Tuning Method Decision
Experiments
Experimental Setup
Estimation of Single-GPU Memory Usage
...and 3 more sections

Figures (9)

Figure 1: Peak GPU memory estimates per total parameter size on a single GPU
Figure 2: Illustration of tensors using GPU memory while fine-tuning the pre-trained model ren2021zero
Figure 3: Basic structure of transformer-based decoder model vaswani2017attention. As shown in Figure \ref{['fig:tensors']}, the parameters in the transformer part are managed using chunk-based memory, while the lm_head part, responsible for deriving the output, consumes GPU memory based on its actual size.
Figure 4: Peak GPU memory computation for different distributed fine-tuning methods.
Figure 5: Advanced DP gathers the entire param fp16, while TP maintains the sharded param fp16 intact before entering the computation process.
...and 4 more figures

LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

TL;DR

Abstract

LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (9)