LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits
Zikai Zhou, Qizheng Zhang, Hermann Kumbong, Kunle Olukotun
TL;DR
This work tackles the rising cost of fine-tuning very large language models by enabling LoRA-based fine-tuning at ultra-low-bit quantization levels. It proposes LowRA, an end-to-end framework that combines a per-output-channel mapping/threshold learner, a two-stage ILP-based mixed-precision quantizer, and CUDA-accelerated kernels with LoftQ-based low-rank initialization to train only adapters while keeping base weights quantized. Key contributions include the identification of three quantization limitations, the design of a fine-grained, data-free quantization pipeline, and an evidence-based demonstration that LowRA achieves a superior performance-precision trade-off above $2$ bits and remains accurate down to $1.15$ bits per parameter, with memory reductions up to $50\%$. The practical impact is substantial: enabling on-device and resource-constrained deployment of large models and broad democratization of LLM fine-tuning, supported by an open-source framework for further research and application.
Abstract
Fine-tuning large language models (LLMs) is increasingly costly as models scale to hundreds of billions of parameters, and even parameter-efficient fine-tuning (PEFT) methods like LoRA remain resource-intensive. We introduce LowRA, the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. LowRA optimizes fine-grained quantization - mapping, threshold selection, and precision assignment - while leveraging efficient CUDA kernels for scalable deployment. Extensive evaluations across 4 LLMs and 4 datasets show that LowRA achieves a superior performance-precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit LoRA fine-tuning for resource-constrained environments.
