Table of Contents
Fetching ...

R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization

Jiayi Chen, Jieqi Shi, Jing Huo, Chen Wu

TL;DR

R2Q tackles the difficulty of 2-bit quantization for large language models by decomposing weight quantization into two sequential 1-bit steps, enabling an adaptive, distribution-robust lattice. The method derives an optimal 1-bit solution, adds a residual refinement stage, and uses STE for training, resulting in improved stability and faster convergence. Extensive experiments across Llama-7B, OPT-6.7B, and Qwen show that R2Q outperforms existing 2-bit approaches, and can function as a plug-and-play module within existing QAT frameworks, with substantial reductions in training resources versus training-from-scratch methods. The work demonstrates strong gains in both discriminative and generative tasks, and suggests wide practical impact for edge-deployed LLMs through robust, ultra-low-bit quantization.

Abstract

The rapid progress of Large Language Models (LLMs) has brought substantial computational and memory demands, spurring the adoption of low-bit quantization. While 8-bit and 4-bit formats have become prevalent, extending quantization to 2 bits remains challenging due to severe accuracy degradation. To address this, we propose Residual Refinement Quantization (R2Q)-a novel 2-bit quantization framework that decomposes the process into two sequential 1-bit sub-quantizations, forming an adaptive quantization lattice. Extensive evaluations on Llama, OPT, and Qwen across diverse benchmarks-covering question answering, commonsense reasoning, and language modeling-demonstrate that R2Q consistently outperforms existing 2-bit quantization methods in both fine-grained and coarse-grained settings. By refining quantization through a residual learning mechanism, R2Q enhances performance, improves training stability, and accelerates convergence under extreme compression. Furthermore, its modular design enables seamless integration with existing quantization-aware training (QAT) frameworks.

R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization

TL;DR

R2Q tackles the difficulty of 2-bit quantization for large language models by decomposing weight quantization into two sequential 1-bit steps, enabling an adaptive, distribution-robust lattice. The method derives an optimal 1-bit solution, adds a residual refinement stage, and uses STE for training, resulting in improved stability and faster convergence. Extensive experiments across Llama-7B, OPT-6.7B, and Qwen show that R2Q outperforms existing 2-bit approaches, and can function as a plug-and-play module within existing QAT frameworks, with substantial reductions in training resources versus training-from-scratch methods. The work demonstrates strong gains in both discriminative and generative tasks, and suggests wide practical impact for edge-deployed LLMs through robust, ultra-low-bit quantization.

Abstract

The rapid progress of Large Language Models (LLMs) has brought substantial computational and memory demands, spurring the adoption of low-bit quantization. While 8-bit and 4-bit formats have become prevalent, extending quantization to 2 bits remains challenging due to severe accuracy degradation. To address this, we propose Residual Refinement Quantization (R2Q)-a novel 2-bit quantization framework that decomposes the process into two sequential 1-bit sub-quantizations, forming an adaptive quantization lattice. Extensive evaluations on Llama, OPT, and Qwen across diverse benchmarks-covering question answering, commonsense reasoning, and language modeling-demonstrate that R2Q consistently outperforms existing 2-bit quantization methods in both fine-grained and coarse-grained settings. By refining quantization through a residual learning mechanism, R2Q enhances performance, improves training stability, and accelerates convergence under extreme compression. Furthermore, its modular design enables seamless integration with existing quantization-aware training (QAT) frameworks.

Paper Structure

This paper contains 26 sections, 26 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of the Residual Refinement Quantization (R2Q) mechanism. The full-precision weight $\textbf{W}$ and the first-step residual $\textbf{R}$ are binarized into two 1-bit kernels, $\textbf{Q}_1$ and $\textbf{Q}_2$, which are merged to reconstruct $\textbf{W}$.
  • Figure 2: Comparison of RTN (top) and R2Q (bottom) in 2-bit quantization. Gray lines connect points that correspond to the same real value. Points within the base of the conical region are mapped to the quantized position indicated by the apex. We refer to this structure as a quantization lattice. $s$ and $z$ are the scaling parameters and zero points for RTN, while $\alpha_1$ and $\alpha_2$ represent the scaling parameters of the two kernels of R2Q. respectively. R2Q achieves an adaptive mapping for imbalanced data distributions, whereas RTN results in fixed and uniform allocation—as illustrated in the top image, where only two real values fall into the purple lattice—leading to inefficient use of the limited four quantization levels available in 2-bit quantization.
  • Figure 3: The (a) gradient norms and (b) training loss of BitDistiller (RTN) and BitDistiller (R2Q). The R2Q-integrated version (BitDistiller-R2Q) significantly reduces gradient fluctuations and converges faster and smoother.
  • Figure 4: Weight deviations before and after QAT under RTN and R2Q. MSE is used as the evaluation metric. For clarity, we report results for the first two decoder layers (layer0 and layer1), the last two decoder layers (layer30 and layer31), and the LM head of OPT-6.7B and Llama-7B. R2Q consistently maintains close alignment with the full-precision weights in both coarse-grained and fine-grained settings, whereas RTN shows weak alignment and suffers substantial degradation under coarse-grained quantization.