Table of Contents
Fetching ...

RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy

Geonho Lee, Janghwan Lee, Sukjin Hong, Minsoo Kim, Euijai Ahn, Du-Seong Chang, Jungwook Choi

TL;DR

RILQ introduces Rank-Insensitive LoRA-based Quantization Error Compensation to address the persistent accuracy gap in 2-bit LLM quantization. By adopting a model-level discrepancy loss and GT-Loss, it enables cooperative, low-rank adapter updates that effectively compensate quantization errors across Transformer layers, restoring performance with minimal computational overhead. Across LLaMA-2 and LLaMA-3, RILQ yields consistent 2-bit accuracy gains with multiple state-of-the-art quantizers and supports adapter-merged weight-quantized inference without extra inference cost. The approach demonstrates strong rank-insensitive behavior, scalability to large models, and practical training-time efficiency, making 2-bit LLM deployment more viable; code is available at the referenced GitHub repository.

Abstract

Low-rank adaptation (LoRA) has become the dominant method for parameter-efficient LLM fine-tuning, with LoRA-based quantization error compensation (LQEC) emerging as a powerful tool for recovering accuracy in compressed LLMs. However, LQEC has underperformed in sub-4-bit scenarios, with no prior investigation into understanding this limitation. We propose RILQ (Rank-Insensitive LoRA-based Quantization Error Compensation) to understand fundamental limitation and boost 2-bit LLM accuracy. Based on rank analysis revealing model-wise activation discrepancy loss's rank-insensitive nature, RILQ employs this loss to adjust adapters cooperatively across layers, enabling robust error compensation with low-rank adapters. Evaluations on LLaMA-2 and LLaMA-3 demonstrate RILQ's consistent improvements in 2-bit quantized inference across various state-of-the-art quantizers and enhanced accuracy in task-specific fine-tuning. RILQ maintains computational efficiency comparable to existing LoRA methods, enabling adapter-merged weight-quantized LLM inference with significantly enhanced accuracy, making it a promising approach for boosting 2-bit LLM performance. Our code is available at https://github.com/aiha-lab/RILQ.

RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy

TL;DR

RILQ introduces Rank-Insensitive LoRA-based Quantization Error Compensation to address the persistent accuracy gap in 2-bit LLM quantization. By adopting a model-level discrepancy loss and GT-Loss, it enables cooperative, low-rank adapter updates that effectively compensate quantization errors across Transformer layers, restoring performance with minimal computational overhead. Across LLaMA-2 and LLaMA-3, RILQ yields consistent 2-bit accuracy gains with multiple state-of-the-art quantizers and supports adapter-merged weight-quantized inference without extra inference cost. The approach demonstrates strong rank-insensitive behavior, scalability to large models, and practical training-time efficiency, making 2-bit LLM deployment more viable; code is available at the referenced GitHub repository.

Abstract

Low-rank adaptation (LoRA) has become the dominant method for parameter-efficient LLM fine-tuning, with LoRA-based quantization error compensation (LQEC) emerging as a powerful tool for recovering accuracy in compressed LLMs. However, LQEC has underperformed in sub-4-bit scenarios, with no prior investigation into understanding this limitation. We propose RILQ (Rank-Insensitive LoRA-based Quantization Error Compensation) to understand fundamental limitation and boost 2-bit LLM accuracy. Based on rank analysis revealing model-wise activation discrepancy loss's rank-insensitive nature, RILQ employs this loss to adjust adapters cooperatively across layers, enabling robust error compensation with low-rank adapters. Evaluations on LLaMA-2 and LLaMA-3 demonstrate RILQ's consistent improvements in 2-bit quantized inference across various state-of-the-art quantizers and enhanced accuracy in task-specific fine-tuning. RILQ maintains computational efficiency comparable to existing LoRA methods, enabling adapter-merged weight-quantized LLM inference with significantly enhanced accuracy, making it a promising approach for boosting 2-bit LLM performance. Our code is available at https://github.com/aiha-lab/RILQ.

Paper Structure

This paper contains 31 sections, 6 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: LoRA-based quantization error compensation (LQEC): (a) direct error correction, (b) initialization for task-specific fine-tuning.
  • Figure 2: (a) Structure of the Transformer decoder model. (b-e) Four optimization approaches for fine-tuning LoRA for quantization error compensation.
  • Figure 3: (a) Average CSQA accuracy across optimization granularity and the rank of LoRA (LLaMA-2-7B). (b) Normalized weight discrepancy $(\|W-Q\|_{F})$ across models (LLaMA-2-7B and LLaMA-3-8B) and every linear module, normalized to 1 for 4-bit quantization discrepancy. (c) Minimum rank required for each quantization bit-precision to closely achieve the weight discrepancy of 4-bit quantization.
  • Figure 4: (a) Relative error of the LM-head output activation compared to the baseline inference across error compensation strategies, with base weights quantized using OmniQuant. (b) Relative error of intermediate activations and head output compared to baseline inference. (c) Comparison of average magnitudes of left singular vectors between linear and model level optimization.
  • Figure 5: Additional observations on the comparison of the average magnitudes of each element in the left singular vector between linear and model-level optimization (Fig. \ref{['fig:granularity']}(c)) across different layers.