Understanding the Difficulty of Low-Precision Post-Training Quantization for LLMs
Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang
TL;DR
The paper investigates the difficulty of achieving effective ultra-low-precision post-training quantization for large language models by contrasting PTQ methods that minimize local layer-wise errors (GPTQ) with QAFT methods that minimize the global loss $NLL$ in the quantized setting. Across GPT-2, OPT, and Llama-2 families on 128-example data and multiple bit-widths, QAFT consistently delivers lower global $NLL$ and better Pareto efficiency than GPTQ, especially at int4 and below, due to a misalignment between local $MSE$ and global $NLL$ objectives. Loss-landscape analysis around pretrained weights reveals an attractive basin whose radius $R({\bm{w}})$ governs whether quantization perturbations remain within regions where local improvements translate to global gains; QAFT tends to move along less steep directions that yield lower $NLL$ even when farther from the origin. The findings challenge reliance on local-error metrics for quantization decisions, provide a landscape-based explanation for when PTQ approaches may be effective, and show that QAFT offers a robust practical route for deploying ultra-low-precision LLMs across model families and datasets such as WikiText-2, C4, and LAMBADA.
Abstract
Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision. This can be achieved either through post-training quantization by minimizing local, layer-wise quantization errors, or through quantization-aware fine-tuning by minimizing the global loss function. In this study, we discovered that, under the same data constraint, the former approach nearly always fared worse than the latter, a phenomenon particularly prominent when the numerical precision is very low. We further showed that this difficulty of post-training quantization arose from stark misalignment between optimization of the local and global objective functions. Our findings explains limited utility in minimization of local quantization error and the importance of direct quantization-aware fine-tuning, in the regime of large models at very low precision.
