Table of Contents
Fetching ...

A Comprehensive Evaluation on Quantization Techniques for Large Language Models

Yutong Liu, Cairong Zhao, Guosheng Hu

TL;DR

Post-training quantization (PTQ) can dramatically reduce memory and compute for large language models, but achieving lossless W4A4 remains challenging due to outliers and distribution skew. The authors propose a two-step quantization framework that separates pre-quantization transformation (shifting, scaling, rotation) from quantization error mitigation (self-compensation and low-rank compensation), and they evaluate these components under a unified configuration, including FP4 formats MXFP4 and NVFP4. Their results show that optimized pre-quantization (rotation and scaling) combined with GPTQ and low-rank compensation delivers the best PTQ performance, with finer granularity increasing storage but improving accuracy, and with FP4 scaling-factor formats playing a crucial role. The work provides practical guidelines for FP4/INT4 quantization of LLMs and contributes an open framework for fair benchmarking across quantization settings.

Abstract

For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often evaluated under different settings because a method typically contains multiple components. Analyzing connections among existing methods is important for deeper understanding. To bridge these gaps, we conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison. To our knowledge, such a fair and extensive investigation remains critically underexplored. To better understand connections, first, we decouple published quantization methods into two steps: pre-quantization transformation and quantization error mitigation. The former is a preprocessing step that reduces outlier impact by flattening the data distribution; the latter offsets quantization errors to improve performance. Second, we evaluate and analyze the impact of different settings, including granularity and symmetry. Third, we analyze and evaluate the latest MXFP4 and NVFP4 data formats and their performance. Our experiments first demonstrate that optimized rotation and scaling yield the best pre-quantization performance, and that combining low-rank compensation with GPTQ can occasionally outperform GPTQ alone for error mitigation. Second, finer granularity improves performance but increases storage overhead. Third, we find that scaling-factor format and precision greatly affect FP4 performance, and that rotation-based strategies effective for INT4 offer limited gains for MXFP4 and NVFP4, motivating further study.

A Comprehensive Evaluation on Quantization Techniques for Large Language Models

TL;DR

Post-training quantization (PTQ) can dramatically reduce memory and compute for large language models, but achieving lossless W4A4 remains challenging due to outliers and distribution skew. The authors propose a two-step quantization framework that separates pre-quantization transformation (shifting, scaling, rotation) from quantization error mitigation (self-compensation and low-rank compensation), and they evaluate these components under a unified configuration, including FP4 formats MXFP4 and NVFP4. Their results show that optimized pre-quantization (rotation and scaling) combined with GPTQ and low-rank compensation delivers the best PTQ performance, with finer granularity increasing storage but improving accuracy, and with FP4 scaling-factor formats playing a crucial role. The work provides practical guidelines for FP4/INT4 quantization of LLMs and contributes an open framework for fair benchmarking across quantization settings.

Abstract

For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often evaluated under different settings because a method typically contains multiple components. Analyzing connections among existing methods is important for deeper understanding. To bridge these gaps, we conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison. To our knowledge, such a fair and extensive investigation remains critically underexplored. To better understand connections, first, we decouple published quantization methods into two steps: pre-quantization transformation and quantization error mitigation. The former is a preprocessing step that reduces outlier impact by flattening the data distribution; the latter offsets quantization errors to improve performance. Second, we evaluate and analyze the impact of different settings, including granularity and symmetry. Third, we analyze and evaluate the latest MXFP4 and NVFP4 data formats and their performance. Our experiments first demonstrate that optimized rotation and scaling yield the best pre-quantization performance, and that combining low-rank compensation with GPTQ can occasionally outperform GPTQ alone for error mitigation. Second, finer granularity improves performance but increases storage overhead. Third, we find that scaling-factor format and precision greatly affect FP4 performance, and that rotation-based strategies effective for INT4 offer limited gains for MXFP4 and NVFP4, motivating further study.

Paper Structure

This paper contains 19 sections, 6 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Pre-quantization transformation: a process that transforms activations and weights to make them easier to quantize. The left graph illustrates a tensor that is difficult to quantize due to the presence of numerous outliers, while the right graph shows that after the pre-quantization transformation, the tensor becomes more uniformly distributed, making quantization easier.
  • Figure 2: Quantization error mitigation: compensate the error produced by quantization. Here, $dq(W_q)$ denotes dequantizing $W_q$ back to the original precision value space, for computational purposes.
  • Figure 3: Symmetry of quantization
  • Figure 4: Granularity of quantization: (1) Per-tensor quantization; (2) Per-channel quantization; (3) Per-group quantization.
  • Figure 5: Weight distribution of a q_proj layer of LlaMA-3.2-1B. (a): Weight distribution in full-precision(BF16). (b): Weight distribution after INT4 quantization (group size=32), compared with MXFP4. (c): Weight distribution after MXFP4 quantization.
  • ...and 1 more figures