Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners

Yifei Gao; Jie Ou; Lei Wang; Jun Cheng; Mengchu Zhou

Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners

Yifei Gao, Jie Ou, Lei Wang, Jun Cheng, Mengchu Zhou

TL;DR

This work tackles the global optimization problem in quantizing large language model weights by revealing that fixed weight distributions and per-layer compensation are insufficient. It introduces Singular-value Diagonal Expansion (SDE), which expands the diagonal structure of the weight's singular values, and Cross-layer Learning (CL), which distributes quantization errors across layers via a two-layer-ahead loss. The authors prove that SDE strictly generalizes Learnable Singular-value Increment (LSI) and demonstrate substantial improvements over OmniQuant, DuQuant, and PrefixQuant across weight-only and weight-activation quantization on LLaMA, OPT, and Vicuna models, with robust gains on open benchmarks and minimal inference-speed impact. These plug‑and‑play techniques offer practical improvements for industrial deployment of quantized LLMs while maintaining manageable training costs and memory usage.

Abstract

The quantization of large language models (LLMs) has been a prominent research area aimed at enabling their lightweight deployment in practice. Existing research about LLM's quantization has mainly explored the interplay between weights and activations, or employing auxiliary components while neglecting the necessity of adjusting weights during quantization. Consequently, original weight distributions frequently fail to yield desired results after round-to-nearest (RTN) quantization. Even though incorporating techniques such as mixed precision and low-rank error approximation in LLM's quantization can yield improved results, they inevitably introduce additional computational overhead. On the other hand, traditional techniques for weight quantization, such as Generative Post-Training Quantization, rely on manually tweaking weight distributions to minimize local errors, but they fall short of achieving globally optimal outcomes. Although the recently proposed Learnable Singular-value Increment improves global weight quantization by modifying weight distributions, it disrupts the original distribution considerably. This introduces pronounced bias toward the training data and can degrade downstream task performance. In this paper, we introduce Singular-value Diagonal Expansion, a more nuanced approach to refining weight distributions to achieve better quantization alignment. Furthermore, we introduce Cross-layer Learning that improves overall quantization outcomes by distributing errors more evenly across layers. Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches, including OmniQuant, DuQuant, and PrefixQuant.

Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners

TL;DR

Abstract

Paper Structure (35 sections, 1 theorem, 14 equations, 10 figures, 8 tables)

This paper contains 35 sections, 1 theorem, 14 equations, 10 figures, 8 tables.

Introduction
Related Work
KV Cache Reduction
Quantiztaion Methods
General Quantization Methods
Quantization Outlier Suppression
Preliminaries
Uniform Quantization
Learnable Singular-value Increment
Proposed Method
Weight Adjustment Properities
Singular-value Diagonal Expension
Weight Adjustment Flexibility
Formal Proof of Greater Flexibility
Cross-layer Learning
...and 20 more sections

Key Result

Theorem 1

Let $\tilde{\mathbf{W}}$ be the set of all weight matrices that can be produced by the LSI parameterization in eq:lsi_decompose, and let $\hat{\mathbf{W}}$ be the corresponding set produced by the SDE parameterization in eq:sde_ele_eq with diagonal‑expansion width $n\!\ge\!0$. Then, for every $n\ge When $n=0$ (equivalently, when all off‑diagonal elements added by $\mathsf{Map}(\mathbf{I}^{D})$ ar

Figures (10)

Figure 1: (a) Compared to existing PTQ weight quantization methods, our techniques achieve superior loss reduction while preserving fast training times. (b) The W2A16g128 quantization results. Our methods consistently deliver superior quantization results even for extremely low-bit settings. (c) The W3A4 quantization on LLaMA-7B. Our techniques significantly outperform the original DuQuant approach, yielding substantially improved performance.
Figure 2: Overview of SDE. After decomposing the linear weight matrix $\mathbf{W}$ into $\mathbf{U}$, $\mathbf{S}$, and $\mathbf{V}$ through singular value decomposition, our technique introduces a learnable matrix $\mathbf{I}^{D}$. This matrix is then appropriately mapped into the diagonal positions of $diag(\mathbf{S})$ using the mapping function $Map(\cdot)$ defined in Eq. \ref{['eq:sde_eq']}.
Figure 3: Error introduction after quantization. The weight is from the 2nd layer o-proj in LLaMA-3-8B. The weight matrix is divided into $32 \times 32$ blocks with $128 \times 128$ individual weights in each, and quantization is performed block-by-block. (a) illustrates the mean of normalized weight, while (b) the normalized quantization errors introduced by each block.
Figure 4: Weight redistribution under 3-bit quantization. The dashed lines represent the corresponding quantized integers after scaling. Both LSI and SDE modify the weight distribution to align the quantization setting, but LSI tends to induce much more disturbance, with MSE compared with original weights a third higher than SDE.
Figure 5: GPT-4 evaluation on the MT-Bench.
...and 5 more figures

Theorems & Definitions (1)

Theorem 1: SDE Strictly Generalizes LSI

Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners

TL;DR

Abstract

Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (1)