Dissecting Quantization Error: A Concentration-Alignment Perspective

Marco Federici; Boris van Breugel; Paul Whatmough; Markus Nagel

Dissecting Quantization Error: A Concentration-Alignment Perspective

Marco Federici, Boris van Breugel, Paul Whatmough, Markus Nagel

TL;DR

This work analyzes linear-layer quantization via the signal-to-quantization-noise ratio (SQNR), showing that for uniform integer quantization at a fixed bit width, SQNR decomposes into the concentration of weights and activations and the alignment of their dominant variation directions.

Abstract

Quantization can drastically increase the efficiency of large language and vision models, but typically incurs an accuracy drop. Recently, function-preserving transforms (e.g. rotations, Hadamard transform, channel-wise scaling) have been successfully applied to reduce post-training quantization error, yet a principled explanation remains elusive. We analyze linear-layer quantization via the signal-to-quantization-noise ratio (SQNR), showing that for uniform integer quantization at a fixed bit width, SQNR decomposes into (i) the concentration of weights and activations (capturing spread and outliers), and (ii) the alignment of their dominant variation directions. This reveals an actionable insight: beyond concentration - the focus of most prior transforms (e.g. rotations or Hadamard) - improving alignment between weight and activation can further reduce quantization error. Motivated by this, we introduce block Concentration-Alignment Transforms (CAT), a lightweight linear transformation that uses a covariance estimate from a small calibration set to jointly improve concentration and alignment, approximately maximizing SQNR. Experiments across several LLMs show that CAT consistently matches or outperforms prior transform-based quantization methods at 4-bit precision, confirming the insights gained in our framework.

Dissecting Quantization Error: A Concentration-Alignment Perspective

TL;DR

Abstract

Paper Structure (26 sections, 4 theorems, 26 equations, 6 figures, 1 table)

This paper contains 26 sections, 4 theorems, 26 equations, 6 figures, 1 table.

Introduction
Concentration-Alignment Framework
Interactions of Bit width, Concentration and Alignment
Bit width
Concentration
Alignment
Analysis of Linear Transformations
Concentration-Alignment Transforms
Optimizing alignment
Approximately optimal transform
CAT Block Approximation
Related Work
Experiments
Models.
Calibration.
...and 11 more sections

Key Result

Lemma 2.1

The SQNR of a quantized linear layer can be approximated with the harmonic sum (parallel) of the SQNR measured by quantizing activations and weights separately:

Figures (6)

Figure 1: The signal-to-quantization-noise (SQNR) of a quantized linear layer can be factorized into a bit width term, concentration, and alignment. Orthogonal transforms (e.g. Hadamard, rotation) can improve the concentration (reduce outliers), but not the alignment. Concentration-Alignment Transform (CAT) is designed to improve alignment too, which yields a W4A4 SQNR that often rivals W6A6 quantization.
Figure 2: Empirical verification of the approximation reported in Theorem \ref{['th:main_result']} for linear layers of LLama-v32-1B (left) and Qwen-v3-8B with Hadamard rotations applied before every linear layer (right) at W4A4, W4A8 and W8A8 quantization. Each dot represents one linear layer in the architecture. The approximation is close to the true SQNR for almost all layers. In L3.2 1B-it, the exception is layer.1.mlp.down_proj, which is easier to quantize than the approximation suggests, due to the massive outlier of the [BOS] token sun_massive_2024 dominating the SQNR computation. For large SQNR, floating point issues limit the true SQNR (Qwen v3, top right).
Figure 3: Comparison of activation quantization SQNR (y-axis), weight quantization SQNR (x-axis) and joint SQNR (iso-lines) for the linear layers of a Llama v3 8B architecture quantized at several bit widths. Starting from a 4 bit quntization of the linear layer (bottom left), increasing the weight bit width by 4 bits will result in a horizontal shift of around 24 dB, while increasing number of bits used for the activations results in a corresponding vertical shift. Since activation SQNR is worse than weight SQNR ($r({\bm{x}},{\bm{W}})<1$), this latter scenario results in much higher overall SQNR.
Figure 4: Distribution of concentration of weight (left) and activation (right) quantization, for different layers and under different transforms. Activations are more heavy-tailed (worse than Laplace, red region) than weights, without transforms. Channel scaling (such as SmoothQuant) moves the activation outliers into the weights---this improves activation concentration, but worsens weight concentration significantly. Hadamard and CAT mix channels, which effectively makes them close to Gaussian for all layers.
Figure 5: Distribution of alignment across layers under different transforms. The green shaded region indicates the achievable region, and underlines vast room for alignment improvements ( $>10$ dB) on many layers. Note that rotation-based transforms cannot improve alignment, hence Hadamard transforms performs the same as no transforms. Channel scaling is similar to an alignment-optimal with block size 1. However, due to the sub-optimality scaling the channels using the ratio of the maximums does not necessarily improve on alignment on all layers. Block-diagonal matrices informed by CAT are good approximations of $\hat{{\bm{M}}}$, and consistently improve alignment across all layers.
...and 1 more figures

Theorems & Definitions (4)

Lemma 2.1
Lemma 2.2
Lemma 2.3
Theorem 2.4

Dissecting Quantization Error: A Concentration-Alignment Perspective

TL;DR

Abstract

Dissecting Quantization Error: A Concentration-Alignment Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)