Table of Contents
Fetching ...

ASER: Activation Smoothing and Error Reconstruction for Large Language Model Quantization

Weibo Zhao, Yubin Shi, Xinyu Lyu, Wanchen Sui, Shen Li, Yong Li

TL;DR

ASER addresses the degradation of large language models under low-bit post-training quantization by revealing a low-rank structure in the activation-weight quantization error and targeting it with two complementary techniques. The method combines Error Reconstruction via whitening SVD using LoRA-style matrices and Activation Smoothing to handle activation outliers, enabling effective compensation with a small set of parameters. Across LLaMA3-8B, Qwen1.5-7B, and Qwen-72B, ASER delivers perplexity and accuracy close to FP16, often outperforming state-of-the-art PTQ baselines, with negligible overhead. This work provides a practical, data-efficient pathway for robust activation quantization in large transformers, expanding the viability of low-bit quantization for deployment at scale.

Abstract

Quantization stands as a pivotal technique for large language model (LLM) serving, yet it poses significant challenges particularly in achieving effective low-bit quantization. The limited numerical mapping makes the quantized model produce a non-trivial error, bringing out intolerable performance degration. This paper is anchored in the basic idea of model compression objectives, and delves into the layer-wise error distribution of LLMs during post-training quantization. Subsequently, we introduce ASER, an algorithm consisting of (1) Error Reconstruction: low-rank compensation for quantization error with LoRA-style matrices constructed by whitening SVD; (2) Activation Smoothing: outlier extraction to gain smooth activation and better error compensation. ASER is capable of quantizing typical LLMs to low-bit ones, particularly preserving accuracy even in W4A8 per-channel setup. Experimental results show that ASER is competitive among the state-of-the-art quantization algorithms, showing potential to activation quantization, with minor overhead.

ASER: Activation Smoothing and Error Reconstruction for Large Language Model Quantization

TL;DR

ASER addresses the degradation of large language models under low-bit post-training quantization by revealing a low-rank structure in the activation-weight quantization error and targeting it with two complementary techniques. The method combines Error Reconstruction via whitening SVD using LoRA-style matrices and Activation Smoothing to handle activation outliers, enabling effective compensation with a small set of parameters. Across LLaMA3-8B, Qwen1.5-7B, and Qwen-72B, ASER delivers perplexity and accuracy close to FP16, often outperforming state-of-the-art PTQ baselines, with negligible overhead. This work provides a practical, data-efficient pathway for robust activation quantization in large transformers, expanding the viability of low-bit quantization for deployment at scale.

Abstract

Quantization stands as a pivotal technique for large language model (LLM) serving, yet it poses significant challenges particularly in achieving effective low-bit quantization. The limited numerical mapping makes the quantized model produce a non-trivial error, bringing out intolerable performance degration. This paper is anchored in the basic idea of model compression objectives, and delves into the layer-wise error distribution of LLMs during post-training quantization. Subsequently, we introduce ASER, an algorithm consisting of (1) Error Reconstruction: low-rank compensation for quantization error with LoRA-style matrices constructed by whitening SVD; (2) Activation Smoothing: outlier extraction to gain smooth activation and better error compensation. ASER is capable of quantizing typical LLMs to low-bit ones, particularly preserving accuracy even in W4A8 per-channel setup. Experimental results show that ASER is competitive among the state-of-the-art quantization algorithms, showing potential to activation quantization, with minor overhead.

Paper Structure

This paper contains 14 sections, 13 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: The framework of ASER. The LoRA-style matrices $\textbf{L}_A, \textbf{L}_B$ are generated to reconstruct the quantization error.
  • Figure 2: The largest $128$ singular value distribution of quantization error in the $30^\text{th}$ layer of LLaMA3-8B by RTN quantization. The singular values are normalized for comparison.
  • Figure 3: The effective rank of $\textbf{E}_q\textbf{X}$ across layers in LLaMA3-8B by RTN quantization.
  • Figure 4: Magnitude of activation-weight quantization error $\textbf{E}_q\textbf{X}$, mean activation $\bar{\textbf{X}}$, mean weight $\bar{\textbf{W}}$ and their product $\bar{\textbf{X}}\bar{\textbf{W}}$. The channel index is sorted by $\bar{\textbf{X}}\bar{\textbf{W}}$. Data comes from the largest $512$ channels of a linear layer in LLaMA3-8B when conduting RTN quantization.
  • Figure 5: Perplexity of quantized Qwen1.5-7B with int8 weight and different quantization bit of activation, i.e., W8Ax.
  • ...and 1 more figures