Table of Contents
Fetching ...

DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, Ying Wei

TL;DR

DuQuant addresses a core bottleneck in post-training quantization of large language models: activation outliers, especially Massive Outliers that concentrate in FFN down-projections. By combining a block-diagonal rotation with a zigzag channel permutation, DuQuant redistributes outliers across activations and across blocks, while a smoothing diagonal further shifts remaining difficulty away from activations. Theoretical guarantees accompany empirical results showing 4-bit weight-activation quantization now outperforms prior baselines across multiple LLM families (LLaMA, Vicuna, LLaMA3) and tasks, with substantial speedups and memory savings. The approach eliminates reliance on GPTQ for many settings and delivers robust performance under calibration-free and low-data conditions, enabling practical deployment of quantized LLMs on resource-constrained hardware.

Abstract

Quantization of large language models (LLMs) faces significant challenges, particularly due to the presence of outlier activations that impede efficient low-bit representation. Traditional approaches predominantly address Normal Outliers, which are activations across all tokens with relatively large magnitudes. However, these methods struggle with smoothing Massive Outliers that display significantly larger values, which leads to significant performance degradation in low-bit quantization. In this paper, we introduce DuQuant, a novel approach that utilizes rotation and permutation transformations to more effectively mitigate both massive and normal outliers. First, DuQuant starts by constructing the rotation matrix, using specific outlier dimensions as prior knowledge, to redistribute outliers to adjacent channels by block-wise rotation. Second, We further employ a zigzag permutation to balance the distribution of outliers across blocks, thereby reducing block-wise variance. A subsequent rotation further smooths the activation landscape, enhancing model performance. DuQuant simplifies the quantization process and excels in managing outliers, outperforming the state-of-the-art baselines across various sizes and types of LLMs on multiple tasks, even with 4-bit weight-activation quantization. Our code is available at https://github.com/Hsu1023/DuQuant.

DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

TL;DR

DuQuant addresses a core bottleneck in post-training quantization of large language models: activation outliers, especially Massive Outliers that concentrate in FFN down-projections. By combining a block-diagonal rotation with a zigzag channel permutation, DuQuant redistributes outliers across activations and across blocks, while a smoothing diagonal further shifts remaining difficulty away from activations. Theoretical guarantees accompany empirical results showing 4-bit weight-activation quantization now outperforms prior baselines across multiple LLM families (LLaMA, Vicuna, LLaMA3) and tasks, with substantial speedups and memory savings. The approach eliminates reliance on GPTQ for many settings and delivers robust performance under calibration-free and low-data conditions, enabling practical deployment of quantized LLMs on resource-constrained hardware.

Abstract

Quantization of large language models (LLMs) faces significant challenges, particularly due to the presence of outlier activations that impede efficient low-bit representation. Traditional approaches predominantly address Normal Outliers, which are activations across all tokens with relatively large magnitudes. However, these methods struggle with smoothing Massive Outliers that display significantly larger values, which leads to significant performance degradation in low-bit quantization. In this paper, we introduce DuQuant, a novel approach that utilizes rotation and permutation transformations to more effectively mitigate both massive and normal outliers. First, DuQuant starts by constructing the rotation matrix, using specific outlier dimensions as prior knowledge, to redistribute outliers to adjacent channels by block-wise rotation. Second, We further employ a zigzag permutation to balance the distribution of outliers across blocks, thereby reducing block-wise variance. A subsequent rotation further smooths the activation landscape, enhancing model performance. DuQuant simplifies the quantization process and excels in managing outliers, outperforming the state-of-the-art baselines across various sizes and types of LLMs on multiple tasks, even with 4-bit weight-activation quantization. Our code is available at https://github.com/Hsu1023/DuQuant.
Paper Structure (59 sections, 4 theorems, 12 equations, 14 figures, 36 tables, 1 algorithm)

This paper contains 59 sections, 4 theorems, 12 equations, 14 figures, 36 tables, 1 algorithm.

Key Result

Theorem 1

For the activation input $\mathbf{X}\in \mathbb{R}^{T\times C_{in}}$, $\hat{\mathbf{R}}\in \mathbb{R}^{2^n\times 2^n}$ is a diagonal block matrix constructed as per Eqn. (eq:diagonal-rotation). For a specific block $b_i$, let $O_j(\cdot)$ represent the maximum outlier of the $j$-th dimension $d_j$ w

Figures (14)

  • Figure 1: Visualizations of Outliers in LLaMA2-7B. (a) Input activation of Layer1 attention key projection shows Normal Outliers with relatively high magnitudes across all token sequences. (b) Input activation of Layer1 FFN down projection reveals Massive Outliers, presenting extremely high magnitudes (around 1400) at very few tokens. (c) Application of SmoothQuant on FFN down projection, illustrating its struggle with massive outliers in the Activation matrix. (d) Corresponding weight changes with SmoothQuant, highlighting the emergence of new outliers.
  • Figure 2: Transformation Steps for Activation Matrices after smooth technique. (a) Sequential transformations on Normal Outliers: ① initial rotation to reduce outliers within blocks, ② permutation to evenly distribute outliers across blocks, and ③ a second rotation for further smoothing. (b) Activation changes for Massive Outliers before and after DuQuant. (c) A sample matrix for highlighting the continual reduction of outliers through rotation and permutation, with outliers marked in dark blue.
  • Figure 3: GPT-4 evaluation on the MT-Bench.
  • Figure 4: LLaMA2-7B Attention key_proj.
  • Figure 5: Computational overhead analysis.
  • ...and 9 more figures

Theorems & Definitions (6)

  • Theorem 1: Rotation
  • Theorem 2: Zigzag Permutation
  • Theorem 1: Rotation
  • proof
  • Theorem 2: Zigzag Permutation
  • proof