R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization
Jiayi Chen, Jieqi Shi, Jing Huo, Chen Wu
TL;DR
R2Q tackles the difficulty of 2-bit quantization for large language models by decomposing weight quantization into two sequential 1-bit steps, enabling an adaptive, distribution-robust lattice. The method derives an optimal 1-bit solution, adds a residual refinement stage, and uses STE for training, resulting in improved stability and faster convergence. Extensive experiments across Llama-7B, OPT-6.7B, and Qwen show that R2Q outperforms existing 2-bit approaches, and can function as a plug-and-play module within existing QAT frameworks, with substantial reductions in training resources versus training-from-scratch methods. The work demonstrates strong gains in both discriminative and generative tasks, and suggests wide practical impact for edge-deployed LLMs through robust, ultra-low-bit quantization.
Abstract
The rapid progress of Large Language Models (LLMs) has brought substantial computational and memory demands, spurring the adoption of low-bit quantization. While 8-bit and 4-bit formats have become prevalent, extending quantization to 2 bits remains challenging due to severe accuracy degradation. To address this, we propose Residual Refinement Quantization (R2Q)-a novel 2-bit quantization framework that decomposes the process into two sequential 1-bit sub-quantizations, forming an adaptive quantization lattice. Extensive evaluations on Llama, OPT, and Qwen across diverse benchmarks-covering question answering, commonsense reasoning, and language modeling-demonstrate that R2Q consistently outperforms existing 2-bit quantization methods in both fine-grained and coarse-grained settings. By refining quantization through a residual learning mechanism, R2Q enhances performance, improves training stability, and accelerates convergence under extreme compression. Furthermore, its modular design enables seamless integration with existing quantization-aware training (QAT) frameworks.
