A Comprehensive Evaluation on Quantization Techniques for Large Language Models
Yutong Liu, Cairong Zhao, Guosheng Hu
TL;DR
Post-training quantization (PTQ) can dramatically reduce memory and compute for large language models, but achieving lossless W4A4 remains challenging due to outliers and distribution skew. The authors propose a two-step quantization framework that separates pre-quantization transformation (shifting, scaling, rotation) from quantization error mitigation (self-compensation and low-rank compensation), and they evaluate these components under a unified configuration, including FP4 formats MXFP4 and NVFP4. Their results show that optimized pre-quantization (rotation and scaling) combined with GPTQ and low-rank compensation delivers the best PTQ performance, with finer granularity increasing storage but improving accuracy, and with FP4 scaling-factor formats playing a crucial role. The work provides practical guidelines for FP4/INT4 quantization of LLMs and contributes an open framework for fair benchmarking across quantization settings.
Abstract
For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often evaluated under different settings because a method typically contains multiple components. Analyzing connections among existing methods is important for deeper understanding. To bridge these gaps, we conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison. To our knowledge, such a fair and extensive investigation remains critically underexplored. To better understand connections, first, we decouple published quantization methods into two steps: pre-quantization transformation and quantization error mitigation. The former is a preprocessing step that reduces outlier impact by flattening the data distribution; the latter offsets quantization errors to improve performance. Second, we evaluate and analyze the impact of different settings, including granularity and symmetry. Third, we analyze and evaluate the latest MXFP4 and NVFP4 data formats and their performance. Our experiments first demonstrate that optimized rotation and scaling yield the best pre-quantization performance, and that combining low-rank compensation with GPTQ can occasionally outperform GPTQ alone for error mitigation. Second, finer granularity improves performance but increases storage overhead. Third, we find that scaling-factor format and precision greatly affect FP4 performance, and that rotation-based strategies effective for INT4 offer limited gains for MXFP4 and NVFP4, motivating further study.
