ASER: Activation Smoothing and Error Reconstruction for Large Language Model Quantization
Weibo Zhao, Yubin Shi, Xinyu Lyu, Wanchen Sui, Shen Li, Yong Li
TL;DR
ASER addresses the degradation of large language models under low-bit post-training quantization by revealing a low-rank structure in the activation-weight quantization error and targeting it with two complementary techniques. The method combines Error Reconstruction via whitening SVD using LoRA-style matrices and Activation Smoothing to handle activation outliers, enabling effective compensation with a small set of parameters. Across LLaMA3-8B, Qwen1.5-7B, and Qwen-72B, ASER delivers perplexity and accuracy close to FP16, often outperforming state-of-the-art PTQ baselines, with negligible overhead. This work provides a practical, data-efficient pathway for robust activation quantization in large transformers, expanding the viability of low-bit quantization for deployment at scale.
Abstract
Quantization stands as a pivotal technique for large language model (LLM) serving, yet it poses significant challenges particularly in achieving effective low-bit quantization. The limited numerical mapping makes the quantized model produce a non-trivial error, bringing out intolerable performance degration. This paper is anchored in the basic idea of model compression objectives, and delves into the layer-wise error distribution of LLMs during post-training quantization. Subsequently, we introduce ASER, an algorithm consisting of (1) Error Reconstruction: low-rank compensation for quantization error with LoRA-style matrices constructed by whitening SVD; (2) Activation Smoothing: outlier extraction to gain smooth activation and better error compensation. ASER is capable of quantizing typical LLMs to low-bit ones, particularly preserving accuracy even in W4A8 per-channel setup. Experimental results show that ASER is competitive among the state-of-the-art quantization algorithms, showing potential to activation quantization, with minor overhead.
