DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

Haokun Lin; Haobo Xu; Yichen Wu; Jingzhi Cui; Yingtao Zhang; Linzhan Mou; Linqi Song; Zhenan Sun; Ying Wei

DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, Ying Wei

TL;DR

DuQuant addresses a core bottleneck in post-training quantization of large language models: activation outliers, especially Massive Outliers that concentrate in FFN down-projections. By combining a block-diagonal rotation with a zigzag channel permutation, DuQuant redistributes outliers across activations and across blocks, while a smoothing diagonal further shifts remaining difficulty away from activations. Theoretical guarantees accompany empirical results showing 4-bit weight-activation quantization now outperforms prior baselines across multiple LLM families (LLaMA, Vicuna, LLaMA3) and tasks, with substantial speedups and memory savings. The approach eliminates reliance on GPTQ for many settings and delivers robust performance under calibration-free and low-data conditions, enabling practical deployment of quantized LLMs on resource-constrained hardware.

Abstract

Quantization of large language models (LLMs) faces significant challenges, particularly due to the presence of outlier activations that impede efficient low-bit representation. Traditional approaches predominantly address Normal Outliers, which are activations across all tokens with relatively large magnitudes. However, these methods struggle with smoothing Massive Outliers that display significantly larger values, which leads to significant performance degradation in low-bit quantization. In this paper, we introduce DuQuant, a novel approach that utilizes rotation and permutation transformations to more effectively mitigate both massive and normal outliers. First, DuQuant starts by constructing the rotation matrix, using specific outlier dimensions as prior knowledge, to redistribute outliers to adjacent channels by block-wise rotation. Second, We further employ a zigzag permutation to balance the distribution of outliers across blocks, thereby reducing block-wise variance. A subsequent rotation further smooths the activation landscape, enhancing model performance. DuQuant simplifies the quantization process and excels in managing outliers, outperforming the state-of-the-art baselines across various sizes and types of LLMs on multiple tasks, even with 4-bit weight-activation quantization. Our code is available at https://github.com/Hsu1023/DuQuant.

DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

TL;DR

Abstract

Paper Structure (59 sections, 4 theorems, 12 equations, 14 figures, 36 tables, 1 algorithm)

This paper contains 59 sections, 4 theorems, 12 equations, 14 figures, 36 tables, 1 algorithm.

Introduction
Motivation
Normal Outliers and Massive Outliers.
Massive Outliers Exist at the Second Linear Layer of FFN Module.
Massive Outliers Enlarge Quantization Difficulty.
Method
Preliminaries
The proposed DuQuant Method
The Rotation Transformation.
The Permutation Transformation.
The Overall DuQuant Method.
Theoretical Analysis
Experiment
Models and Evaluations.
Implementation Details.
...and 44 more sections

Key Result

Theorem 1

For the activation input $\mathbf{X}\in \mathbb{R}^{T\times C_{in}}$, $\hat{\mathbf{R}}\in \mathbb{R}^{2^n\times 2^n}$ is a diagonal block matrix constructed as per Eqn. (eq:diagonal-rotation). For a specific block $b_i$, let $O_j(\cdot)$ represent the maximum outlier of the $j$-th dimension $d_j$ w

Figures (14)

Figure 1: Visualizations of Outliers in LLaMA2-7B. (a) Input activation of Layer1 attention key projection shows Normal Outliers with relatively high magnitudes across all token sequences. (b) Input activation of Layer1 FFN down projection reveals Massive Outliers, presenting extremely high magnitudes (around 1400) at very few tokens. (c) Application of SmoothQuant on FFN down projection, illustrating its struggle with massive outliers in the Activation matrix. (d) Corresponding weight changes with SmoothQuant, highlighting the emergence of new outliers.
Figure 2: Transformation Steps for Activation Matrices after smooth technique. (a) Sequential transformations on Normal Outliers: ① initial rotation to reduce outliers within blocks, ② permutation to evenly distribute outliers across blocks, and ③ a second rotation for further smoothing. (b) Activation changes for Massive Outliers before and after DuQuant. (c) A sample matrix for highlighting the continual reduction of outliers through rotation and permutation, with outliers marked in dark blue.
Figure 3: GPT-4 evaluation on the MT-Bench.
Figure 4: LLaMA2-7B Attention key_proj.
Figure 5: Computational overhead analysis.
...and 9 more figures

Theorems & Definitions (6)

Theorem 1: Rotation
Theorem 2: Zigzag Permutation
Theorem 1: Rotation
proof
Theorem 2: Zigzag Permutation
proof

DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

TL;DR

Abstract

DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (6)