Table of Contents
Fetching ...

Understanding the Difficulty of Low-Precision Post-Training Quantization for LLMs

Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang

TL;DR

The paper investigates the difficulty of achieving effective ultra-low-precision post-training quantization for large language models by contrasting PTQ methods that minimize local layer-wise errors (GPTQ) with QAFT methods that minimize the global loss $NLL$ in the quantized setting. Across GPT-2, OPT, and Llama-2 families on 128-example data and multiple bit-widths, QAFT consistently delivers lower global $NLL$ and better Pareto efficiency than GPTQ, especially at int4 and below, due to a misalignment between local $MSE$ and global $NLL$ objectives. Loss-landscape analysis around pretrained weights reveals an attractive basin whose radius $R({\bm{w}})$ governs whether quantization perturbations remain within regions where local improvements translate to global gains; QAFT tends to move along less steep directions that yield lower $NLL$ even when farther from the origin. The findings challenge reliance on local-error metrics for quantization decisions, provide a landscape-based explanation for when PTQ approaches may be effective, and show that QAFT offers a robust practical route for deploying ultra-low-precision LLMs across model families and datasets such as WikiText-2, C4, and LAMBADA.

Abstract

Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision. This can be achieved either through post-training quantization by minimizing local, layer-wise quantization errors, or through quantization-aware fine-tuning by minimizing the global loss function. In this study, we discovered that, under the same data constraint, the former approach nearly always fared worse than the latter, a phenomenon particularly prominent when the numerical precision is very low. We further showed that this difficulty of post-training quantization arose from stark misalignment between optimization of the local and global objective functions. Our findings explains limited utility in minimization of local quantization error and the importance of direct quantization-aware fine-tuning, in the regime of large models at very low precision.

Understanding the Difficulty of Low-Precision Post-Training Quantization for LLMs

TL;DR

The paper investigates the difficulty of achieving effective ultra-low-precision post-training quantization for large language models by contrasting PTQ methods that minimize local layer-wise errors (GPTQ) with QAFT methods that minimize the global loss in the quantized setting. Across GPT-2, OPT, and Llama-2 families on 128-example data and multiple bit-widths, QAFT consistently delivers lower global and better Pareto efficiency than GPTQ, especially at int4 and below, due to a misalignment between local and global objectives. Loss-landscape analysis around pretrained weights reveals an attractive basin whose radius governs whether quantization perturbations remain within regions where local improvements translate to global gains; QAFT tends to move along less steep directions that yield lower even when farther from the origin. The findings challenge reliance on local-error metrics for quantization decisions, provide a landscape-based explanation for when PTQ approaches may be effective, and show that QAFT offers a robust practical route for deploying ultra-low-precision LLMs across model families and datasets such as WikiText-2, C4, and LAMBADA.

Abstract

Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision. This can be achieved either through post-training quantization by minimizing local, layer-wise quantization errors, or through quantization-aware fine-tuning by minimizing the global loss function. In this study, we discovered that, under the same data constraint, the former approach nearly always fared worse than the latter, a phenomenon particularly prominent when the numerical precision is very low. We further showed that this difficulty of post-training quantization arose from stark misalignment between optimization of the local and global objective functions. Our findings explains limited utility in minimization of local quantization error and the importance of direct quantization-aware fine-tuning, in the regime of large models at very low precision.

Paper Structure

This paper contains 20 sections, 2 equations, 7 figures.

Figures (7)

  • Figure 1: Misalignment between minimization of the global $\mathrm{NLL}$ loss ($\textnormal{QAFT}$) and minimization of the local layer-wise $\mathrm{MSE}$ losses ($\textnormal{GPTQ}$). The upper row shows global NLL losses, and the lower row presents layer-wise MSE losses for three models (one per column). Data points compare QAFT (vertical axis) to GPTQ (horizontal axis). The gray diagonal indicates identity. Black dots (if present) represent full-precision models, while colored dots mark losses after QAFT and GPTQ. Colored lines originating from each dot intersect the diagonal, showing RTN-quantized model losses for the corresponding format. In the lower row, symbols represent individual quantized layers.
  • Figure 2: Tradeoff between quantized model generalization and its weight size. Upper: models from the GPT-2 model family: distilgpt2, gpt2, gpt2-medium, gpt2-large and gpt2-xl. Lower: models from the OPT and Llama 2 families: opt-250m, opt-350m, opt-1.3b, opt-2.7b, opt-6.7b and Llama-2-7b-hf. Black circles represent the full-precision models. Hollow colored circles are $\textnormal{RTN}$-quantized models, solid colored circles are $\textnormal{QAFT}$-quantized models, and solid colored squares are $\textnormal{GPTQ}$-quantized models. Dotted, dashed and solid gray lines connect quantized solutions from the same model produced by $\textnormal{RTN}$, $\textnormal{GPTQ}$ and $\textnormal{QAFT}$, respectively. We highlight the difference between $\textnormal{GPTQ}$- and $\textnormal{QAFT}$-quantized models with colored, transparent, vertical strips, for each quantized model.
  • Figure 3: Loss landscape analysis of quantized model weights. Data illustrated here are from opt-125m, a network small enough for numerous loss evaluations. In the legend at the top, we illustrate the mapping strategy in a $2$-dimensional cartoon, which captures key concepts in the $D$-dimensional weight space. The black dot in the middle marks the pretraining convergence ${\bm{w}}$. The continuous loss landscape is probed first by measuring loss at ${\bm{w}} + \lambda \hat{{\bm{e}}}$, i.e. pretrained weight subject to random perturbation $\hat{{\bm{e}}} \sim {\mathcal{S}}^D$ sampled uniformly from the $D$-dimensional unit sphere. We sweep $\lambda \in {\mathbb{R}}^+$ (thin, light gray lines emanating from the black circle) to map the radial loss landscape along a specific random direction $\hat{{\bm{e}}}$. The gray grid represents the representable weight values prescribed by the weight quantizer $Q(\cdot)$, out of which we show three key quantized weights under question: ${\bm{w}}_\textnormal{RTN}=Q({\bm{w}})$ (blue circle), ${\bm{w}}_\textnormal{QAFT}$ (green circle), and ${\bm{w}}_\textnormal{GPTQ}$ (red circle). We measure the loss function at these key points as well as those along the linear segment resulting from a convex combination of two of these (colored lines). We plot the radial loss landscape ($\mathrm{NLL}$ loss against $\ell_2$ distance from ${\bm{w}}$) in the lower panels, training loss on the left and validation loss on the right. Graphical symbols of points and segments are consistent with the legend at the top.
  • Figure 4: Misalignment between minimization of the global NLL loss (by QAFT) and minimization of the local layer-wise MSE losses (by GPTQ) on different datasets. Follows the same conventions as Figure \ref{['fig:gptq_vs_qaft']}.
  • Figure 5: Loss landscape analysis of quantized model weights on different datasets. Plots showing results for opt-125m quantized in int4. Follows the same conventions as Figure \ref{['fig:lls']}.
  • ...and 2 more figures