Table of Contents
Fetching ...

MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration

Jinguang Wang, Jingyu Wang, Haifeng Sun, Tingting Yang, Zirui Zhuang, Wanyi Ning, Yuexi Yin, Qi Qi, Jianxin Liao

TL;DR

MergeQuant proposes a per-channel static 4-bit quantization framework for LLMs that integrates Quantization Step Migration to align activation quantization with integer acceleration, thereby removing per-token and per-step quantization overhead in autoregressive generation. It combines dimension reconstruction and adaptive clipping to normalize channel scales and redistributes channel variation to subsequent modules, plus a LoRA-style compensation to recover accuracy. Empirical results on Llama-2 and Llama-3 models show MergeQuant narrows the FP16 gap to about 1.3 points on large models and delivers substantial speedups in prefill, decoding, and end-to-end inference on RTX 3090, with notable memory savings. The approach avoids full retraining, enabling practical, low-cost deployment of 4-bit static quantized LLMs with strong hardware efficiency.

Abstract

Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becomes considerably expensive. In this work, we propose MergeQuant, an accurate and efficient per-channel static quantization framework. MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method, thereby eliminating the quantization overheads before and after matrix multiplication. Furthermore, in view of the significant differences between the different channel ranges, we propose dimensional reconstruction and adaptive clipping to address the non-uniformity of quantization scale factors and redistribute the channel variations to the subsequent modules to balance the parameter distribution under QSM. Within the static quantization setting of W4A4, MergeQuant reduces the accuracy gap on zero-shot tasks compared to FP16 baseline to 1.3 points on Llama-2-70B model. On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.

MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration

TL;DR

MergeQuant proposes a per-channel static 4-bit quantization framework for LLMs that integrates Quantization Step Migration to align activation quantization with integer acceleration, thereby removing per-token and per-step quantization overhead in autoregressive generation. It combines dimension reconstruction and adaptive clipping to normalize channel scales and redistributes channel variation to subsequent modules, plus a LoRA-style compensation to recover accuracy. Empirical results on Llama-2 and Llama-3 models show MergeQuant narrows the FP16 gap to about 1.3 points on large models and delivers substantial speedups in prefill, decoding, and end-to-end inference on RTX 3090, with notable memory savings. The approach avoids full retraining, enabling practical, low-cost deployment of 4-bit static quantized LLMs with strong hardware efficiency.

Abstract

Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becomes considerably expensive. In this work, we propose MergeQuant, an accurate and efficient per-channel static quantization framework. MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method, thereby eliminating the quantization overheads before and after matrix multiplication. Furthermore, in view of the significant differences between the different channel ranges, we propose dimensional reconstruction and adaptive clipping to address the non-uniformity of quantization scale factors and redistribute the channel variations to the subsequent modules to balance the parameter distribution under QSM. Within the static quantization setting of W4A4, MergeQuant reduces the accuracy gap on zero-shot tasks compared to FP16 baseline to 1.3 points on Llama-2-70B model. On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.

Paper Structure

This paper contains 19 sections, 7 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: We evaluate the accuracy of three different calibrations and show the results when combined with rotation-based method in Llama models, measured on PIQA.
  • Figure 2: The overall framework of MergeQuant. (a,b): the integration of MergeQuant with a Llama layer, where quantization and dequantization migrations are represented in blue, dimension reconstruction and adaptive clipping are represented in yellow; (c): The influence of "DeQuant" step overlap on weight before and after dimension reconstruction; (d): necessary notations of MergeQuant.
  • Figure 3: For Llama-2-7B model, measure the decoding speedup and end-to-end speedup across various batch sizes. We pre-filled 2048 tokens and decoded 256 tokens. The experiments are performed on RTX 3090 GPUs.
  • Figure 4: The quantization process of Dynamic Quantization (a) and our MergeQuant (b).
  • Figure 5: Visualization of Maximum absolute values in linear layers of Llama-2-7B
  • ...and 2 more figures