First-Order Error Matters: Accurate Compensation for Quantized Large Language Models
Xingyu Zheng, Haotong Qin, Yuye Li, Haoran Chu, Jiakai Wang, Jinyang Guo, Michele Magno, Xianglong Liu
TL;DR
FOEM tackles the challenge of accurate post-training quantization for large language models by addressing the neglected first-order term in quantization error. It introduces a practical gradient-aware correction within a Taylor expansion around pre-quantization weights, and exploits Cholesky-based inverses to avoid explicit Hessian computations, yielding a Hessian-free, low-overhead solution. Empirically, FOEM consistently surpasses GPTQ across a range of models and configurations, including 3-bit weight-only and W4A4KV4 with SpinQuant, demonstrating strong cross-model generalization and deployment efficiency. This work advances PTQ practicality for resource-constrained deployments by improving accuracy without imposing significant computational burden.
Abstract
Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by performing a first-order Taylor expansion around the pre-quantization weights. This yields an approximation based on the difference between latent and full-precision weights as well as the Hessian matrix. When substituted into the theoretical solution, the formulation eliminates the need to explicitly compute the Hessian, thereby avoiding the high computational cost and limited generalization of backpropagation-based gradient methods. This design introduces only minimal additional computational overhead. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 17.3% and increases the 5-shot MMLU accuracy from 53.8% achieved by GPTAQ to 56.1%. Moreover, FOEM can be seamlessly combined with advanced techniques such as SpinQuant, delivering additional gains under the challenging W4A4KV4 setting and further narrowing the performance gap with full-precision baselines, surpassing existing state-of-the-art methods.
