Table of Contents
Fetching ...

LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

Zikai Zhou, Qizheng Zhang, Hermann Kumbong, Kunle Olukotun

TL;DR

This work tackles the rising cost of fine-tuning very large language models by enabling LoRA-based fine-tuning at ultra-low-bit quantization levels. It proposes LowRA, an end-to-end framework that combines a per-output-channel mapping/threshold learner, a two-stage ILP-based mixed-precision quantizer, and CUDA-accelerated kernels with LoftQ-based low-rank initialization to train only adapters while keeping base weights quantized. Key contributions include the identification of three quantization limitations, the design of a fine-grained, data-free quantization pipeline, and an evidence-based demonstration that LowRA achieves a superior performance-precision trade-off above $2$ bits and remains accurate down to $1.15$ bits per parameter, with memory reductions up to $50\%$. The practical impact is substantial: enabling on-device and resource-constrained deployment of large models and broad democratization of LLM fine-tuning, supported by an open-source framework for further research and application.

Abstract

Fine-tuning large language models (LLMs) is increasingly costly as models scale to hundreds of billions of parameters, and even parameter-efficient fine-tuning (PEFT) methods like LoRA remain resource-intensive. We introduce LowRA, the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. LowRA optimizes fine-grained quantization - mapping, threshold selection, and precision assignment - while leveraging efficient CUDA kernels for scalable deployment. Extensive evaluations across 4 LLMs and 4 datasets show that LowRA achieves a superior performance-precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit LoRA fine-tuning for resource-constrained environments.

LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

TL;DR

This work tackles the rising cost of fine-tuning very large language models by enabling LoRA-based fine-tuning at ultra-low-bit quantization levels. It proposes LowRA, an end-to-end framework that combines a per-output-channel mapping/threshold learner, a two-stage ILP-based mixed-precision quantizer, and CUDA-accelerated kernels with LoftQ-based low-rank initialization to train only adapters while keeping base weights quantized. Key contributions include the identification of three quantization limitations, the design of a fine-grained, data-free quantization pipeline, and an evidence-based demonstration that LowRA achieves a superior performance-precision trade-off above bits and remains accurate down to bits per parameter, with memory reductions up to . The practical impact is substantial: enabling on-device and resource-constrained deployment of large models and broad democratization of LLM fine-tuning, supported by an open-source framework for further research and application.

Abstract

Fine-tuning large language models (LLMs) is increasingly costly as models scale to hundreds of billions of parameters, and even parameter-efficient fine-tuning (PEFT) methods like LoRA remain resource-intensive. We introduce LowRA, the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. LowRA optimizes fine-grained quantization - mapping, threshold selection, and precision assignment - while leveraging efficient CUDA kernels for scalable deployment. Extensive evaluations across 4 LLMs and 4 datasets show that LowRA achieves a superior performance-precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit LoRA fine-tuning for resource-constrained environments.

Paper Structure

This paper contains 53 sections, 5 equations, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: End-to-end workflow of LowRA.
  • Figure 2: Distributions of normalized parameters in different output channels sampled from the first layer of Llama2-7b.
  • Figure 3: Roles of mappings and thresholds in quantization. Circles represent thresholds whereas crosses represent mappings. Colored Triangles represent the process of converting a range of original/unquantized real values - partitioned by thresholds - to the mapped values corresponding to each quantization level.
  • Figure 4: Two-step ILP-based Workflow for Channelwise Precision Assignment
  • Figure 5: Overview of Kernel for Low-Bit Fine-Grained Quantzation. Indexing logic omitted for simplicity.
  • ...and 5 more figures