Table of Contents
Fetching ...

Practical offloading for fine-tuning LLM on commodity GPU via learned sparse projectors

Siyuan Chen, Zhuofeng Wang, Zelong Guan, Yudong Liu, Phillip B. Gibbons

TL;DR

This work tackles the memory and bandwidth bottlenecks of fine-tuning large language models on commodity GPUs by proposing LSP-Offload, a subspace-based offloading framework built on learned $(d,r)$-sparse projectors. By projecting gradients and updates into large, memory-efficient subspaces and overlapping CPU-GPU computation with communication in a layer-wise schedule, the approach achieves near-native fine-tuning speed on laptop GPUs and high-end desktops while preserving convergence to native accuracy. The authors provide a theoretical convergence analysis, a detailed implementation, and extensive empirical evaluation showing substantial speedups over Zero-Offload and improvements over PEFT baselines across GLUE and instruction-tuning tasks. The methodology enables practical fine-tuning of sizable models on modest hardware, with open-source release and guidelines for hyperparameter choices to balance memory, bias, and convergence. Overall, LSP-Offload demonstrates that learned sparse subspaces, coupled with fine-grained scheduling, can dramatically expand the accessibility and efficiency of LLM fine-tuning on commodity hardware.

Abstract

Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU. A common solution to this memory challenge is offloading compute and data from the GPU to the CPU. However, this approach is hampered by the limited bandwidth of commodity hardware, which constrains communication between the CPU and GPU, and by slower matrix multiplications on the CPU. In this paper, we present an offloading framework, LSP-Offload, that enables near-native speed LLM fine-tuning on commodity hardware through learned sparse projectors. Our data-driven approach involves learning efficient sparse compressors that minimize communication with minimal precision loss. Additionally, we introduce a novel layer-wise communication schedule to maximize parallelism between communication and computation. As a result, our framework can fine-tune a 1.3 billion parameter model on a 4GB laptop GPU and a 6.7 billion parameter model on a 24GB NVIDIA RTX 4090 GPU. Compared to state-of-the-art offloading frameworks, our approach reduces end-to-end fine-tuning time by 33.1%-62.5% when converging to the same accuracy. We open source our framework at https://github.com/gulang2019/LSP-Offload.

Practical offloading for fine-tuning LLM on commodity GPU via learned sparse projectors

TL;DR

This work tackles the memory and bandwidth bottlenecks of fine-tuning large language models on commodity GPUs by proposing LSP-Offload, a subspace-based offloading framework built on learned -sparse projectors. By projecting gradients and updates into large, memory-efficient subspaces and overlapping CPU-GPU computation with communication in a layer-wise schedule, the approach achieves near-native fine-tuning speed on laptop GPUs and high-end desktops while preserving convergence to native accuracy. The authors provide a theoretical convergence analysis, a detailed implementation, and extensive empirical evaluation showing substantial speedups over Zero-Offload and improvements over PEFT baselines across GLUE and instruction-tuning tasks. The methodology enables practical fine-tuning of sizable models on modest hardware, with open-source release and guidelines for hyperparameter choices to balance memory, bias, and convergence. Overall, LSP-Offload demonstrates that learned sparse subspaces, coupled with fine-grained scheduling, can dramatically expand the accessibility and efficiency of LLM fine-tuning on commodity hardware.

Abstract

Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU. A common solution to this memory challenge is offloading compute and data from the GPU to the CPU. However, this approach is hampered by the limited bandwidth of commodity hardware, which constrains communication between the CPU and GPU, and by slower matrix multiplications on the CPU. In this paper, we present an offloading framework, LSP-Offload, that enables near-native speed LLM fine-tuning on commodity hardware through learned sparse projectors. Our data-driven approach involves learning efficient sparse compressors that minimize communication with minimal precision loss. Additionally, we introduce a novel layer-wise communication schedule to maximize parallelism between communication and computation. As a result, our framework can fine-tune a 1.3 billion parameter model on a 4GB laptop GPU and a 6.7 billion parameter model on a 24GB NVIDIA RTX 4090 GPU. Compared to state-of-the-art offloading frameworks, our approach reduces end-to-end fine-tuning time by 33.1%-62.5% when converging to the same accuracy. We open source our framework at https://github.com/gulang2019/LSP-Offload.
Paper Structure (42 sections, 4 theorems, 17 equations, 9 figures, 5 tables, 3 algorithms)

This paper contains 42 sections, 4 theorems, 17 equations, 9 figures, 5 tables, 3 algorithms.

Key Result

Theorem 1

For any $\beta >0, 0 < \delta < 1$, suppose that f is an L-smooth function, Assumptions asp:effOfSubspace, asp:boundedBias, asp:sparseBias hold and that we check every iteration in Alg. alg:SGESchedule with the subsampled data set $\cal{D}'$ of size $\mathcal{O}(\frac{8\gamma^2}{3\beta^2}\log{\frac{

Figures (9)

  • Figure 1: LSP-Offload
  • Figure 2: Normalized slowdown of Zero's schedule on laptop and workstation GPUs. The breakdown for communication (Comm) depicts the additional slowdown due to communication that is not overlapped with GPU compute. Similarly, the CPU compute and Other are additional non-overlapped overheads. The experiments are done using precision fp16, the largest batch sizes (BS) that fit, and gradient checkpointing.
  • Figure 3: Comparison between current offloading pipelines and LSP-Offload's overlapped pipeline.
  • Figure 4: Visualization on Optimization Space.
  • Figure 5: End-to-end evaluation of LSP-Offload. Rolling average is applied. Shades are for deviation.
  • ...and 4 more figures

Theorems & Definitions (10)

  • Definition 1: $(d,r)$-Sparse Projector
  • Definition 2: estimation bias
  • Theorem 1
  • remark 1
  • remark 2
  • Lemma 1: Matrix Chernoff
  • Lemma 2
  • proof
  • Theorem 2
  • proof