Identifying Sensitive Weights via Post-quantization Integral
Yuezhou Hu, Weiyu Huang, Zichen Liang, Chang Chen, Jintao Zhang, Jun Zhu, Jianfei Chen
TL;DR
This work addresses the challenge of post-training quantization (PTQ) for large language models by showing that traditional gradient- and Hessian-based sensitivity metrics badly underestimate the loss change $ΔF$ due to the limited convergence radius of Taylor approximations. It introduces Post-quantization Integral (PQI), a posterior, path-aware sensitivity estimator that integrates along the actual quantization path from $\boldsymbol{w}$ to $\tilde{\boldsymbol{w}}$, and uses both endpoints for accuracy. Building on PQI, the authors propose ReQuant, a Dense-and-Sparse decomposition pipeline with self-adaptive outlier selection and step-wise significant weight detachment, to boost PTQ performance. Experiments on Llama 3.2 1B/3B show notable perplexity and few-shot improvements over strong baselines, with practical gains in decoding efficiency and accuracy, demonstrating PQI’s value for robust weight quantization in large models.
Abstract
Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization's impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, \ie, gradient and Hessian term in Taylor's formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.
