Table of Contents
Fetching ...

BiSup: Bidirectional Quantization Error Suppression for Large Language Models

Minghui Zou, Ronghui Guo, Sai Zhang, Xiaowang Zhang, Zhiyong Feng

TL;DR

BiSup tackles the challenge of bidirectional quantization error in weight-activation quantization for large language models by introducing a unified framework that combines quantization-aware parameter-efficient fine-tuning with a prompt-based mixed-precision strategy. It components include fine-grained clipping, soft-constrained smoothing, stabilized low-rank error compensation, and a trainable, PEFT-based optimization over carefully designed parameter spaces, alongside maintaining high-precision system prompts to preserve important token interactions. Empirically, BiSup yields consistent improvements on Llama and Qwen families, notably reducing WikiText2 perplexity under challenging configurations (e.g., from 13.26 to 9.41 for Atom and from 14.33 to 7.85 for QuaRot under W3A3-g128), and ablation studies confirm the effectiveness of each module. The work highlights the practical potential of combining minimal data-dependent fine-tuning with selective high-precision prompts to enable robust, low-bit quantization for deployment of large-scale LLMs in resource-constrained settings.

Abstract

As the size and context length of Large Language Models (LLMs) grow, weight-activation quantization has emerged as a crucial technique for efficient deployment of LLMs. Compared to weight-only quantization, weight-activation quantization presents greater challenges due to the presence of outliers in activations. Existing methods have made significant progress by exploring mixed-precision quantization and outlier suppression. However, these methods primarily focus on optimizing the results of single matrix multiplication, neglecting the bidirectional propagation of quantization errors in LLMs. Specifically, errors accumulate vertically within the same token through layers, and diffuse horizontally across different tokens due to self-attention mechanisms. To address this issue, we introduce BiSup, a Bidirectional quantization error Suppression method. By constructing appropriate optimizable parameter spaces, BiSup utilizes a small amount of data for quantization-aware parameter-efficient fine-tuning to suppress the error vertical accumulation. Besides, BiSup employs prompt mixed-precision quantization strategy, which preserves high precision for the key-value cache of system prompts, to mitigate the error horizontal diffusion. Extensive experiments on Llama and Qwen families demonstrate that BiSup can improve performance over two state-of-the-art methods (the average WikiText2 perplexity decreases from 13.26 to 9.41 for Atom and from 14.33 to 7.85 for QuaRot under the W3A3-g128 configuration), further facilitating the practical applications of low-bit weight-activation quantization.

BiSup: Bidirectional Quantization Error Suppression for Large Language Models

TL;DR

BiSup tackles the challenge of bidirectional quantization error in weight-activation quantization for large language models by introducing a unified framework that combines quantization-aware parameter-efficient fine-tuning with a prompt-based mixed-precision strategy. It components include fine-grained clipping, soft-constrained smoothing, stabilized low-rank error compensation, and a trainable, PEFT-based optimization over carefully designed parameter spaces, alongside maintaining high-precision system prompts to preserve important token interactions. Empirically, BiSup yields consistent improvements on Llama and Qwen families, notably reducing WikiText2 perplexity under challenging configurations (e.g., from 13.26 to 9.41 for Atom and from 14.33 to 7.85 for QuaRot under W3A3-g128), and ablation studies confirm the effectiveness of each module. The work highlights the practical potential of combining minimal data-dependent fine-tuning with selective high-precision prompts to enable robust, low-bit quantization for deployment of large-scale LLMs in resource-constrained settings.

Abstract

As the size and context length of Large Language Models (LLMs) grow, weight-activation quantization has emerged as a crucial technique for efficient deployment of LLMs. Compared to weight-only quantization, weight-activation quantization presents greater challenges due to the presence of outliers in activations. Existing methods have made significant progress by exploring mixed-precision quantization and outlier suppression. However, these methods primarily focus on optimizing the results of single matrix multiplication, neglecting the bidirectional propagation of quantization errors in LLMs. Specifically, errors accumulate vertically within the same token through layers, and diffuse horizontally across different tokens due to self-attention mechanisms. To address this issue, we introduce BiSup, a Bidirectional quantization error Suppression method. By constructing appropriate optimizable parameter spaces, BiSup utilizes a small amount of data for quantization-aware parameter-efficient fine-tuning to suppress the error vertical accumulation. Besides, BiSup employs prompt mixed-precision quantization strategy, which preserves high precision for the key-value cache of system prompts, to mitigate the error horizontal diffusion. Extensive experiments on Llama and Qwen families demonstrate that BiSup can improve performance over two state-of-the-art methods (the average WikiText2 perplexity decreases from 13.26 to 9.41 for Atom and from 14.33 to 7.85 for QuaRot under the W3A3-g128 configuration), further facilitating the practical applications of low-bit weight-activation quantization.
Paper Structure (29 sections, 8 equations, 5 figures, 17 tables, 1 algorithm)

This paper contains 29 sections, 8 equations, 5 figures, 17 tables, 1 algorithm.

Figures (5)

  • Figure 1: The error propagation within attention block under activation-only quantization. Different colors on the tensor represent different meanings, where orange indicates that it contains quantization error and white the opposite.
  • Figure 2: (a) and (b) show the attention maps of Llama3-8B and Llama3-8B-Instruct, respectively, and (c) illustrates the flow of the prompt mixed-precision quantization strategy. The symbols in (c) are explained as follows: SP, UP, and NT represent the system prompt, user prompt, and newly generated next token, respectively. These three elements together constitute the input to the LLM. FP LLM refers to the original LLM, while INT LLM denotes the quantized counterpart. Step ① indicates encoding and storing the system prompt into the KV cache at high precision. Step ② involves the model quantization. Step ③ describes the inference of the user prompt, which includes interactions with the mixed-precision KV cache. Step ④ denotes the prediction of the next token.
  • Figure 3: The mean square error (MSE$\downarrow$) of the activation of the last decoder layer in Llama3-8B. The dataset used for visualization is WikiText2, where Calib is sampled from the training set (and used to calibrate the quanitized model) and Eval is sampled from the test set. Explanation of notation: Atom (Calib) denotes the result of the Atom method on the Calib dataset of WikiText2. Note that due to algorithmic differences, Atom (or Atom_BiSup) and QuaRot (or QuaRot_BiSup) do not have the same activation, so MSE comparisons between these two types of methods make no sense.
  • Figure A1: Loss curves on the first layer of Llama3-8B under different settings. (S)LREC denotes (Stabilized) Low-Rank Error Compensation.
  • Figure A2: Loss curves on the first layer of Qwen1.5-7B under different settings. (S)LREC denotes (Stabilized) Low-Rank Error Compensation.