Table of Contents
Fetching ...

GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration

Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, Priyadarshini Panda

TL;DR

GPTAQ presents a finetuning-free, asymmetric calibration framework for quantizing large transformers, addressing the symmetric calibration bias in GPTQ. By deriving an optimal weight-update rule that accounts for both quantization error and input-asymmetry, and by implementing fourGPU-friendly optimizations—arbitrary weight ordering, residual-decomposition, Cholesky-based Hessian handling, and lazy-batch updates—it achieves superior quantization performance with minimal code changes. The approach scales to extremely large models (e.g., LLaMA3-405B, EVA-02) on a single GPU, delivering notable improvements in both vision and language tasks and reducing perplexity and accuracy gaps relative to full-precision baselines. These results underscore the practical impact of asymmetric calibration and efficient Hessian-aware updates for widespread, low-cost deployment of large transformers.

Abstract

We introduce GPTAQ, a novel finetuning-free quantization method for compressing large-scale transformer architectures. Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model, resulting in a scheme that we call asymmetric calibration. Such a scheme can effectively reduce the quantization error accumulated in previous layers. We analyze this problem using optimal brain compression to derive a close-formed solution. The new solution explicitly minimizes the quantization error as well as the accumulated asymmetry error. Furthermore, we utilize various techniques to parallelize the solution calculation, including channel parallelization, neuron decomposition, and Cholesky reformulation for matrix fusion. As a result, GPTAQ is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization. Remarkably, on a single GPU, we quantize a 405B language transformer as well as EVA-02, the rank first vision transformer that achieves 90% pretraining Imagenet accuracy. Code is available at Github.

GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration

TL;DR

GPTAQ presents a finetuning-free, asymmetric calibration framework for quantizing large transformers, addressing the symmetric calibration bias in GPTQ. By deriving an optimal weight-update rule that accounts for both quantization error and input-asymmetry, and by implementing fourGPU-friendly optimizations—arbitrary weight ordering, residual-decomposition, Cholesky-based Hessian handling, and lazy-batch updates—it achieves superior quantization performance with minimal code changes. The approach scales to extremely large models (e.g., LLaMA3-405B, EVA-02) on a single GPU, delivering notable improvements in both vision and language tasks and reducing perplexity and accuracy gaps relative to full-precision baselines. These results underscore the practical impact of asymmetric calibration and efficient Hessian-aware updates for widespread, low-cost deployment of large transformers.

Abstract

We introduce GPTAQ, a novel finetuning-free quantization method for compressing large-scale transformer architectures. Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model, resulting in a scheme that we call asymmetric calibration. Such a scheme can effectively reduce the quantization error accumulated in previous layers. We analyze this problem using optimal brain compression to derive a close-formed solution. The new solution explicitly minimizes the quantization error as well as the accumulated asymmetry error. Furthermore, we utilize various techniques to parallelize the solution calculation, including channel parallelization, neuron decomposition, and Cholesky reformulation for matrix fusion. As a result, GPTAQ is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization. Remarkably, on a single GPU, we quantize a 405B language transformer as well as EVA-02, the rank first vision transformer that achieves 90% pretraining Imagenet accuracy. Code is available at Github.

Paper Structure

This paper contains 24 sections, 2 theorems, 39 equations, 4 figures, 9 tables, 2 algorithms.

Key Result

Lemma 4.1

Given the Cholesky factor ${\mathbf{L}}$ for the full inverse Hessian matrix ${\mathbf{H}}^{-1}$, the inverse Hessian ${\mathbf{H}}^{-1}_{-q:} = ({\mathbf{X}}_{-q:}{\mathbf{X}}^\top_{-q:})^{-1}$ is equivalent to ${\mathbf{L}}_{q+1:,q+1:}{\mathbf{L}}_{q+1:,q+1:}^\top$.

Figures (4)

  • Figure 1: Calibration pipeline in the symmetric way (GPTQ) and the asymmetric way (GPTAQ).
  • Figure 2: Visualization of input activation MAE loss ($|\widetilde{{\mathbf{X}}}-{\mathbf{X}}|$) when calibrating LLaMA3-8B using GPTQ and GPTAQ.
  • Figure 3: Computing paradigm of the second term for residual output error in $q=2$ iteration. The inverse Hessian matrix is factorized by Cholesky Decomposition, furthermore, $\Delta{\mathbf{X}}_{q,:}{\mathbf{X}}^\top{\mathbf{H}}^{-1}_{-q:}$ is fused to the $q$-th row of matrix ${\mathbf{P}}$, which can be computed in parallel. Dimensions are in the bottom left corner of each matrix.
  • Figure 4: Latency visualization of our algorithm under various $n$.

Theorems & Definitions (4)

  • Lemma 4.1
  • Theorem 4.2
  • proof
  • proof