BiSup: Bidirectional Quantization Error Suppression for Large Language Models
Minghui Zou, Ronghui Guo, Sai Zhang, Xiaowang Zhang, Zhiyong Feng
TL;DR
BiSup tackles the challenge of bidirectional quantization error in weight-activation quantization for large language models by introducing a unified framework that combines quantization-aware parameter-efficient fine-tuning with a prompt-based mixed-precision strategy. It components include fine-grained clipping, soft-constrained smoothing, stabilized low-rank error compensation, and a trainable, PEFT-based optimization over carefully designed parameter spaces, alongside maintaining high-precision system prompts to preserve important token interactions. Empirically, BiSup yields consistent improvements on Llama and Qwen families, notably reducing WikiText2 perplexity under challenging configurations (e.g., from 13.26 to 9.41 for Atom and from 14.33 to 7.85 for QuaRot under W3A3-g128), and ablation studies confirm the effectiveness of each module. The work highlights the practical potential of combining minimal data-dependent fine-tuning with selective high-precision prompts to enable robust, low-bit quantization for deployment of large-scale LLMs in resource-constrained settings.
Abstract
As the size and context length of Large Language Models (LLMs) grow, weight-activation quantization has emerged as a crucial technique for efficient deployment of LLMs. Compared to weight-only quantization, weight-activation quantization presents greater challenges due to the presence of outliers in activations. Existing methods have made significant progress by exploring mixed-precision quantization and outlier suppression. However, these methods primarily focus on optimizing the results of single matrix multiplication, neglecting the bidirectional propagation of quantization errors in LLMs. Specifically, errors accumulate vertically within the same token through layers, and diffuse horizontally across different tokens due to self-attention mechanisms. To address this issue, we introduce BiSup, a Bidirectional quantization error Suppression method. By constructing appropriate optimizable parameter spaces, BiSup utilizes a small amount of data for quantization-aware parameter-efficient fine-tuning to suppress the error vertical accumulation. Besides, BiSup employs prompt mixed-precision quantization strategy, which preserves high precision for the key-value cache of system prompts, to mitigate the error horizontal diffusion. Extensive experiments on Llama and Qwen families demonstrate that BiSup can improve performance over two state-of-the-art methods (the average WikiText2 perplexity decreases from 13.26 to 9.41 for Atom and from 14.33 to 7.85 for QuaRot under the W3A3-g128 configuration), further facilitating the practical applications of low-bit weight-activation quantization.
