Table of Contents
Fetching ...

QERA: an Analytical Framework for Quantization Error Reconstruction

Cheng Zhang, Jeffrey T. H. Wong, Can Xiao, George A. Constantinides, Yiren Zhao

TL;DR

QERA introduces an analytical framework for quantization error reconstruction in neural networks, reframing the reconstruction target as minimizing layer output error rather than weight error. It derives exact and approximate closed-form solutions for the low-rank correction ${C}_{k}$, enabling efficient initialization and reconstruction via ${C}_{k}=({R}_{XX}^{1/2})^{-1}{U}_{:,:k}{\Sigma}_{:k,:k}{V}^T_{:k,:}$ (exact) or ${C}_{k}= {S}^{-1}{U}_{:,:k}{\Sigma}_{:k,:k}{V}^T_{:k,:}$ under uncorrelated embeddings (diag). Empirically, QERA enhances both QPEFT and PTQ across RoBERTa and LLaMA families, beating LoftQ, ZeroQuant-V2, and LQER in accuracy and perplexity, and enabling faster convergence, especially at aggressive quantization. The results demonstrate that analytically grounded layer-output optimization can bridge performance gaps in extremely low-precision quantization, supporting practical deployment of large-language models.

Abstract

The growing number of parameters and computational demands of large language models (LLMs) present significant challenges for their efficient deployment. Recently, there is an increasing interest in quantizing weights to extremely low precision while offsetting the resulting error with low-rank, high-precision error reconstruction terms. The combination of quantization and low-rank approximation is now popular in both adapter-based, parameter-efficient fine-tuning methods such as LoftQ and low-precision inference techniques including ZeroQuant-V2. Usually, the low-rank terms are calculated via the singular value decomposition (SVD) of the weight quantization error, minimizing the Frobenius and spectral norms of the weight approximation error. Recent methods like LQ-LoRA and LQER introduced hand-crafted heuristics to minimize errors in layer outputs (activations) rather than weights, resulting improved quantization results. However, these heuristic methods lack an analytical solution to guide the design of quantization error reconstruction terms. In this paper, we revisit this problem and formulate an analytical framework, named Quantization Error Reconstruction Analysis (QERA), and offer a closed-form solution to the problem. We show QERA benefits both existing low-precision fine-tuning and inference methods -- QERA achieves a fine-tuned accuracy gain of $Δ_{\text{acc}}$ = 6.05% of 2-bit RoBERTa-base on GLUE compared to LoftQ; and obtains $Δ_{\text{acc}}$ = 2.97% higher post-training quantization accuracy of 4-bit Llama-3.1-70B on average than ZeroQuant-V2 and $Δ_{\text{ppl}}$ = - 0.28 lower perplexity on WikiText2 than LQER.

QERA: an Analytical Framework for Quantization Error Reconstruction

TL;DR

QERA introduces an analytical framework for quantization error reconstruction in neural networks, reframing the reconstruction target as minimizing layer output error rather than weight error. It derives exact and approximate closed-form solutions for the low-rank correction , enabling efficient initialization and reconstruction via (exact) or under uncorrelated embeddings (diag). Empirically, QERA enhances both QPEFT and PTQ across RoBERTa and LLaMA families, beating LoftQ, ZeroQuant-V2, and LQER in accuracy and perplexity, and enabling faster convergence, especially at aggressive quantization. The results demonstrate that analytically grounded layer-output optimization can bridge performance gaps in extremely low-precision quantization, supporting practical deployment of large-language models.

Abstract

The growing number of parameters and computational demands of large language models (LLMs) present significant challenges for their efficient deployment. Recently, there is an increasing interest in quantizing weights to extremely low precision while offsetting the resulting error with low-rank, high-precision error reconstruction terms. The combination of quantization and low-rank approximation is now popular in both adapter-based, parameter-efficient fine-tuning methods such as LoftQ and low-precision inference techniques including ZeroQuant-V2. Usually, the low-rank terms are calculated via the singular value decomposition (SVD) of the weight quantization error, minimizing the Frobenius and spectral norms of the weight approximation error. Recent methods like LQ-LoRA and LQER introduced hand-crafted heuristics to minimize errors in layer outputs (activations) rather than weights, resulting improved quantization results. However, these heuristic methods lack an analytical solution to guide the design of quantization error reconstruction terms. In this paper, we revisit this problem and formulate an analytical framework, named Quantization Error Reconstruction Analysis (QERA), and offer a closed-form solution to the problem. We show QERA benefits both existing low-precision fine-tuning and inference methods -- QERA achieves a fine-tuned accuracy gain of = 6.05% of 2-bit RoBERTa-base on GLUE compared to LoftQ; and obtains = 2.97% higher post-training quantization accuracy of 4-bit Llama-3.1-70B on average than ZeroQuant-V2 and = - 0.28 lower perplexity on WikiText2 than LQER.
Paper Structure (40 sections, 2 theorems, 38 equations, 24 figures, 17 tables, 2 algorithms)

This paper contains 40 sections, 2 theorems, 38 equations, 24 figures, 17 tables, 2 algorithms.

Key Result

Theorem 1

The solution to Problem problem:minimize-output-error is where ${\bm{R}}_{{\mathbb{X}}{\mathbb{X}}}$ is the autocorrelation matrix respect to the input space ${\mathbb{X}}$, ${\bm{R}}_{{\mathbb{X}}{\mathbb{X}}}^{\frac{1}{2}}$ represents the unique symmetric positive semi-definite matrix square root of ${\bm{R}}_{{\mathbb{X}}{\mathbb{X}}}$, and ${\bm{U}}_{:,:k}$, ${\bm{\Sigma}}_{:k,:k}$, and ${\b

Figures (24)

  • Figure 1: The model output error of RoBERTa-base before fine-tuning. We feed 128 samples from RoBERTa's pretraining dataset and profile the output logits error between the adapted and the FP32 model. We sweep the rank $k$ and the iteration number of LoftQ on 4-bit and 3-bit models. In LoftQ, neither more iterations nor a higher rank guarantees lower model output error, though the weight approximation error of every layer decreases. In contrast, QERA-approx consistently has the lowest model output error across all settings, and the error monotonically decreases as the rank increases.
  • Figure 2: Faster convergence of QERA-approx on STSB.
  • Figure 3: QERA resolves the discrepancy between the recovered model performance and the number of calibration samples in LQER.
  • Figure 4: AlpacaEval 2.0 evaluation results. We compare quantized models to the counterpart without quantization-error reconstruction. A higher win rate ($\uparrow$) indicates better instruction-following performance.
  • Figure 5: Normalized $\mathrm{abs}({\bm{R}}_{{\mathbb{X}}{\mathbb{X}}})$ of the layer inputs in LLaMA-3-8B. Dark elements denotes value close to zero. There are a few layers with input dimensions strongly correlated with others, such as the third attention layer in (a), but for most layers, our assumption of zero-expectation holds.
  • ...and 19 more figures

Theorems & Definitions (6)

  • Theorem 1: QERA-exact solution
  • Remark 1
  • proof
  • Theorem 2: QERA-approx solution
  • Remark 2
  • proof