Table of Contents
Fetching ...

LoHan: Low-Cost High-Performance Framework to Fine-Tune 100B Model on a Consumer GPU

Changyue Liao, Mo Sun, Zihan Yang, Jun Xie, Kaiqi Chen, Binhang Yuan, Fei Wu, Zeke Wang

TL;DR

LoHan tackles the challenge of fine-tuning 100B-scale LLMs on consumer-grade GPUs with limited main memory by introducing holistic tensor management. It combines active gradient offloading to overlap CPU optimizer work with GPU backward and SSD I/O, and a holistic activation swapping strategy that minimizes iteration time through convex optimization of swapped activations. The approach enables first-time 175B model fine-tuning on an RTX 4090, achieves substantial throughput gains over existing baselines, and delivers superior cost-effectiveness compared to DGX-A100 clusters for large-scale fine-tuning. The work demonstrates broad applicability across model scales and even diffusion models, indicating practical impact for researchers with modest hardware budgets.

Abstract

Nowadays, AI researchers become more and more interested in fine-tuning a pre-trained LLM, whose size has grown to up to over 100B parameters, for their downstream tasks. One approach to fine-tune such huge models is to aggregate device memory from many GPUs. However, this approach introduces prohibitive costs for most data scientists with a limited budget for high-end GPU servers. In this paper, we focus on LLM fine-tuning on a single consumer-grade GPU in a commodity server with limited main memory capacity, which is accessible to most AI researchers. In such a scenario, existing offloading-based methods fail to fine-tune an LLM efficiently due to a lack of holistic intra-server tensor movement management. To this end, we present LoHan, a low-cost, high-performance deep learning training framework that enables efficient 100B-scale model fine-tuning on a commodity server with a consumer-grade GPU and limited main memory capacity. The key idea is to add holistic offloading traffic as an optimization dimension for 1)active gradient offloading, and 2)holistic traffic-aware activation swapping mechanism. The experimental results show that 1)LoHan is the first to fine-tune a 175B model on an RTX 4090 and 256 GB main memory, 2)LoHan achieves 2.32x throughput than the state-of-the-art baselines when fine-tuning a small 13B model, and 3)LoHan enables a cheap low-end consumer GPU to have higher cost-effectiveness than a DGX-A100 cluster when fine-tuning a 175B model.

LoHan: Low-Cost High-Performance Framework to Fine-Tune 100B Model on a Consumer GPU

TL;DR

LoHan tackles the challenge of fine-tuning 100B-scale LLMs on consumer-grade GPUs with limited main memory by introducing holistic tensor management. It combines active gradient offloading to overlap CPU optimizer work with GPU backward and SSD I/O, and a holistic activation swapping strategy that minimizes iteration time through convex optimization of swapped activations. The approach enables first-time 175B model fine-tuning on an RTX 4090, achieves substantial throughput gains over existing baselines, and delivers superior cost-effectiveness compared to DGX-A100 clusters for large-scale fine-tuning. The work demonstrates broad applicability across model scales and even diffusion models, indicating practical impact for researchers with modest hardware budgets.

Abstract

Nowadays, AI researchers become more and more interested in fine-tuning a pre-trained LLM, whose size has grown to up to over 100B parameters, for their downstream tasks. One approach to fine-tune such huge models is to aggregate device memory from many GPUs. However, this approach introduces prohibitive costs for most data scientists with a limited budget for high-end GPU servers. In this paper, we focus on LLM fine-tuning on a single consumer-grade GPU in a commodity server with limited main memory capacity, which is accessible to most AI researchers. In such a scenario, existing offloading-based methods fail to fine-tune an LLM efficiently due to a lack of holistic intra-server tensor movement management. To this end, we present LoHan, a low-cost, high-performance deep learning training framework that enables efficient 100B-scale model fine-tuning on a commodity server with a consumer-grade GPU and limited main memory capacity. The key idea is to add holistic offloading traffic as an optimization dimension for 1)active gradient offloading, and 2)holistic traffic-aware activation swapping mechanism. The experimental results show that 1)LoHan is the first to fine-tune a 175B model on an RTX 4090 and 256 GB main memory, 2)LoHan achieves 2.32x throughput than the state-of-the-art baselines when fine-tuning a small 13B model, and 3)LoHan enables a cheap low-end consumer GPU to have higher cost-effectiveness than a DGX-A100 cluster when fine-tuning a 175B model.
Paper Structure (24 sections, 4 theorems, 8 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 24 sections, 4 theorems, 8 equations, 13 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

The sum of convex functions is convex.

Figures (13)

  • Figure 1: Comparison of offloading-based systems. Bandwidth and overhead numbers are obtained on our evaluation server with 12 SSDs when fine-tuning a 13B model with a batch size of 32.
  • Figure 2: Issues of SSD-offloading methods that motivate the design of LoHan. We perform the experiments on RTX 4090.
  • Figure 3: Comparison of active gradient offloading designs.
  • Figure 4: User interface comparison of PyTorch and LoHan.
  • Figure 5: End-to-end GPU throughput comparison between LoHan and baselines with different batch sizes.
  • ...and 8 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4