Table of Contents
Fetching ...

Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization

Yamato Arai, Yuma Ichikawa

TL;DR

This work identifies a critical bottleneck in layer-wise PTQ for large language models: the exponential accumulation of quantization errors across layers, which degrades performance in low-bit regimes. It introduces Quantization Error Propagation (QEP), a lightweight framework that propagates and compensates these accumulated errors, with a tunable propagation strength \\alpha_l to balance overfitting and efficiency. The authors derive a closed-form weight correction and show how to integrate QEP with existing PTQ methods, preserving the Hessian-based acceleration. Empirical results across multiple models and datasets demonstrate substantial improvements in perplexity and zero-shot tasks, especially at 2-bit quantization, suggesting that QEP can enable practical extreme compression while maintaining accuracy. The approach offers a practical, orthogonal enhancement to current PTQ pipelines and points to fruitful future work combining QEP with nonlinear or block-wise quantization techniques.

Abstract

Layer-wise PTQ is a promising technique for compressing large language models (LLMs), due to its simplicity and effectiveness without requiring retraining. However, recent progress in this area is saturating, underscoring the need to revisit its core limitations and explore further improvements. We address this challenge by identifying a key limitation of existing layer-wise PTQ methods: the growth of quantization errors across layers significantly degrades performance, particularly in low-bit regimes. To address this fundamental issue, we propose Quantization Error Propagation (QEP), a general, lightweight, and scalable framework that enhances layer-wise PTQ by explicitly propagating quantization errors and compensating for accumulated errors. QEP also offers a tunable propagation mechanism that prevents overfitting and controls computational overhead, enabling the framework to adapt to various architectures and resource budgets. Extensive experiments on several LLMs demonstrate that QEP-enhanced layer-wise PTQ achieves substantially higher accuracy than existing methods. Notably, the gains are most pronounced in the extremely low-bit quantization regime.

Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization

TL;DR

This work identifies a critical bottleneck in layer-wise PTQ for large language models: the exponential accumulation of quantization errors across layers, which degrades performance in low-bit regimes. It introduces Quantization Error Propagation (QEP), a lightweight framework that propagates and compensates these accumulated errors, with a tunable propagation strength \\alpha_l to balance overfitting and efficiency. The authors derive a closed-form weight correction and show how to integrate QEP with existing PTQ methods, preserving the Hessian-based acceleration. Empirical results across multiple models and datasets demonstrate substantial improvements in perplexity and zero-shot tasks, especially at 2-bit quantization, suggesting that QEP can enable practical extreme compression while maintaining accuracy. The approach offers a practical, orthogonal enhancement to current PTQ pipelines and points to fruitful future work combining QEP with nonlinear or block-wise quantization techniques.

Abstract

Layer-wise PTQ is a promising technique for compressing large language models (LLMs), due to its simplicity and effectiveness without requiring retraining. However, recent progress in this area is saturating, underscoring the need to revisit its core limitations and explore further improvements. We address this challenge by identifying a key limitation of existing layer-wise PTQ methods: the growth of quantization errors across layers significantly degrades performance, particularly in low-bit regimes. To address this fundamental issue, we propose Quantization Error Propagation (QEP), a general, lightweight, and scalable framework that enhances layer-wise PTQ by explicitly propagating quantization errors and compensating for accumulated errors. QEP also offers a tunable propagation mechanism that prevents overfitting and controls computational overhead, enabling the framework to adapt to various architectures and resource budgets. Extensive experiments on several LLMs demonstrate that QEP-enhanced layer-wise PTQ achieves substantially higher accuracy than existing methods. Notably, the gains are most pronounced in the extremely low-bit quantization regime.

Paper Structure

This paper contains 43 sections, 12 theorems, 100 equations, 3 figures, 11 tables.

Key Result

Proposition 5.1

Assume that the matrix $\widehat{{\bm H}}_l$ is invertible. Then, after relaxing the discrete feasible set ${\mathbb Q}^{n_{l} \times d_{l}}$ into the continuous domain ${\mathbb R}^{n_{l} \times d_{l}}$, the optimal solution ${\bm W}_{l}^{\ast}$ is given by the following closed-form expression: where ${\bm \delta}_l \coloneqq {\bm X}_{l} - \widehat{{\bm X}}_{l}$ represents the accumulated quanti

Figures (3)

  • Figure 1: WikiText‑2 perplexity comparison across Llama‑2 models (7B-70B) quantized to INT‑4, INT‑3, and INT‑2, employing RTN, GPTQ, AWQ, and QuIP methods. Solid bars indicate PTQ with QEP; border bars represent PTQ without QEP. Truncated bars indicate perplexities exceeding axis limits. QEP consistently reduces perplexity, with greater improvements observed at lower bitwidths and smaller model sizes. See Section \ref{['sec:experiments']} for detailed settings and results.
  • Figure 2: Accumulation and growth of quantization errors across layers in a partially quantized Llama2-7B model touvron2023llama. The first $10$ Transformer blocks are quantized using standard RTN (BASE) and QEP-enhanced RTN (With QEP), while the remaining Transformer blocks after the $10$th remain at full precision. The plot shows the squared Frobenius norm $\Delta_{m}$, defined in Eq. \ref{['eq:metric-of-qe-growth']}, between the original and partially quantized outputs at each Transformer block $m$.
  • Figure 3: Results averaged over 5 random seeds comparing QuIP with and without QEP across different quantization levels. Each subplot shows results for INT4, INT3, and INT2 quantization, respectively, with the horizontal axis indicating model size (7B, 13B, 70B). The top row reports perplexity on WikiText-2 (lower is better), while the bottom row shows the average of normalized accuracy scores on ARC (easy), PIQA, and StoryCloze benchmarks (higher is better), representing generalization capability. Error bars represent the standard error of the mean (SEM). Models using QEP-QuIP consistently outperform or match the performance of baseline QuIP, especially under more aggressive quantization (INT3 and INT2).

Theorems & Definitions (21)

  • Proposition 5.1
  • Theorem 5.2: Informal
  • Proposition 5.3
  • Proposition 5.4
  • proof
  • Proposition B.1
  • proof
  • Proposition B.2
  • proof
  • Proposition B.3
  • ...and 11 more