Practical offloading for fine-tuning LLM on commodity GPU via learned sparse projectors
Siyuan Chen, Zhuofeng Wang, Zelong Guan, Yudong Liu, Phillip B. Gibbons
TL;DR
This work tackles the memory and bandwidth bottlenecks of fine-tuning large language models on commodity GPUs by proposing LSP-Offload, a subspace-based offloading framework built on learned $(d,r)$-sparse projectors. By projecting gradients and updates into large, memory-efficient subspaces and overlapping CPU-GPU computation with communication in a layer-wise schedule, the approach achieves near-native fine-tuning speed on laptop GPUs and high-end desktops while preserving convergence to native accuracy. The authors provide a theoretical convergence analysis, a detailed implementation, and extensive empirical evaluation showing substantial speedups over Zero-Offload and improvements over PEFT baselines across GLUE and instruction-tuning tasks. The methodology enables practical fine-tuning of sizable models on modest hardware, with open-source release and guidelines for hyperparameter choices to balance memory, bias, and convergence. Overall, LSP-Offload demonstrates that learned sparse subspaces, coupled with fine-grained scheduling, can dramatically expand the accessibility and efficiency of LLM fine-tuning on commodity hardware.
Abstract
Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU. A common solution to this memory challenge is offloading compute and data from the GPU to the CPU. However, this approach is hampered by the limited bandwidth of commodity hardware, which constrains communication between the CPU and GPU, and by slower matrix multiplications on the CPU. In this paper, we present an offloading framework, LSP-Offload, that enables near-native speed LLM fine-tuning on commodity hardware through learned sparse projectors. Our data-driven approach involves learning efficient sparse compressors that minimize communication with minimal precision loss. Additionally, we introduce a novel layer-wise communication schedule to maximize parallelism between communication and computation. As a result, our framework can fine-tune a 1.3 billion parameter model on a 4GB laptop GPU and a 6.7 billion parameter model on a 24GB NVIDIA RTX 4090 GPU. Compared to state-of-the-art offloading frameworks, our approach reduces end-to-end fine-tuning time by 33.1%-62.5% when converging to the same accuracy. We open source our framework at https://github.com/gulang2019/LSP-Offload.
