Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners
Yifei Gao, Jie Ou, Lei Wang, Jun Cheng, Mengchu Zhou
TL;DR
This work tackles the global optimization problem in quantizing large language model weights by revealing that fixed weight distributions and per-layer compensation are insufficient. It introduces Singular-value Diagonal Expansion (SDE), which expands the diagonal structure of the weight's singular values, and Cross-layer Learning (CL), which distributes quantization errors across layers via a two-layer-ahead loss. The authors prove that SDE strictly generalizes Learnable Singular-value Increment (LSI) and demonstrate substantial improvements over OmniQuant, DuQuant, and PrefixQuant across weight-only and weight-activation quantization on LLaMA, OPT, and Vicuna models, with robust gains on open benchmarks and minimal inference-speed impact. These plug‑and‑play techniques offer practical improvements for industrial deployment of quantized LLMs while maintaining manageable training costs and memory usage.
Abstract
The quantization of large language models (LLMs) has been a prominent research area aimed at enabling their lightweight deployment in practice. Existing research about LLM's quantization has mainly explored the interplay between weights and activations, or employing auxiliary components while neglecting the necessity of adjusting weights during quantization. Consequently, original weight distributions frequently fail to yield desired results after round-to-nearest (RTN) quantization. Even though incorporating techniques such as mixed precision and low-rank error approximation in LLM's quantization can yield improved results, they inevitably introduce additional computational overhead. On the other hand, traditional techniques for weight quantization, such as Generative Post-Training Quantization, rely on manually tweaking weight distributions to minimize local errors, but they fall short of achieving globally optimal outcomes. Although the recently proposed Learnable Singular-value Increment improves global weight quantization by modifying weight distributions, it disrupts the original distribution considerably. This introduces pronounced bias toward the training data and can degrade downstream task performance. In this paper, we introduce Singular-value Diagonal Expansion, a more nuanced approach to refining weight distributions to achieve better quantization alignment. Furthermore, we introduce Cross-layer Learning that improves overall quantization outcomes by distributing errors more evenly across layers. Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches, including OmniQuant, DuQuant, and PrefixQuant.
