Table of Contents
Fetching ...

OAC: Output-adaptive Calibration for Accurate Post-training Quantization

Ali Edalati, Alireza Ghaffari, Mahsa Ghazvini Nejad, Lu Hou, Boxing Chen, Masoud Asgharian, Vahid Partovi Nia

TL;DR

The paper tackles the accuracy drop of post-training quantization for large language models at extreme low-precision. It introduces Output-adaptive Calibration (OAC), which directly minimizes distortion in the model’s output by optimizing the cross-entropy loss after quantization, rather than the traditional layer-wise $\ell_2$ loss. To keep computation feasible for billions of parameters, OAC relies on Hessian approximations grounded in the Fisher information identity, and it enforces cross-layer and cross-row independence plus row-wise Hessian aggregation. Empirically, OAC substantially surpasses state-of-the-art PTQ baselines (e.g., SpQR, BiLLM) on 2-bit and binary quantization across multiple LLM families and tasks, with manageable computational costs, especially when gradient computations use FP16. This work demonstrates a practical, output-aware path to near-lossless PTQ for large models in highly constrained precision regimes.

Abstract

Deployment of Large Language Models (LLMs) has major computational costs, due to their rapidly expanding size. Compression of LLMs reduces the memory footprint, latency, and energy required for their inference. Post-training Quantization (PTQ) techniques have been developed to compress LLMs while avoiding expensive re-training. Most PTQ approaches formulate the quantization error based on a layer-wise Euclidean loss, ignoring the model output. Then, each layer is calibrated using its layer-wise Hessian to update the weights towards minimizing the quantization error. The Hessian is also used for detecting the most salient weights to quantization. Such PTQ approaches are prone to accuracy drop in low-precision quantization. We propose Output-adaptive Calibration (OAC) to incorporate the model output in the calibration process. We formulate the quantization error based on the distortion of the output cross-entropy loss. OAC approximates the output-adaptive Hessian for each layer under reasonable assumptions to reduce the computational complexity. The output-adaptive Hessians are used to update the weight matrices and detect the salient weights towards maintaining the model output. Our proposed method outperforms the state-of-the-art baselines such as SpQR and BiLLM, especially, at extreme low-precision (2-bit and binary) quantization.

OAC: Output-adaptive Calibration for Accurate Post-training Quantization

TL;DR

The paper tackles the accuracy drop of post-training quantization for large language models at extreme low-precision. It introduces Output-adaptive Calibration (OAC), which directly minimizes distortion in the model’s output by optimizing the cross-entropy loss after quantization, rather than the traditional layer-wise loss. To keep computation feasible for billions of parameters, OAC relies on Hessian approximations grounded in the Fisher information identity, and it enforces cross-layer and cross-row independence plus row-wise Hessian aggregation. Empirically, OAC substantially surpasses state-of-the-art PTQ baselines (e.g., SpQR, BiLLM) on 2-bit and binary quantization across multiple LLM families and tasks, with manageable computational costs, especially when gradient computations use FP16. This work demonstrates a practical, output-aware path to near-lossless PTQ for large models in highly constrained precision regimes.

Abstract

Deployment of Large Language Models (LLMs) has major computational costs, due to their rapidly expanding size. Compression of LLMs reduces the memory footprint, latency, and energy required for their inference. Post-training Quantization (PTQ) techniques have been developed to compress LLMs while avoiding expensive re-training. Most PTQ approaches formulate the quantization error based on a layer-wise Euclidean loss, ignoring the model output. Then, each layer is calibrated using its layer-wise Hessian to update the weights towards minimizing the quantization error. The Hessian is also used for detecting the most salient weights to quantization. Such PTQ approaches are prone to accuracy drop in low-precision quantization. We propose Output-adaptive Calibration (OAC) to incorporate the model output in the calibration process. We formulate the quantization error based on the distortion of the output cross-entropy loss. OAC approximates the output-adaptive Hessian for each layer under reasonable assumptions to reduce the computational complexity. The output-adaptive Hessians are used to update the weight matrices and detect the salient weights towards maintaining the model output. Our proposed method outperforms the state-of-the-art baselines such as SpQR and BiLLM, especially, at extreme low-precision (2-bit and binary) quantization.
Paper Structure (40 sections, 19 equations, 4 figures, 14 tables, 1 algorithm)

This paper contains 40 sections, 19 equations, 4 figures, 14 tables, 1 algorithm.

Figures (4)

  • Figure 1: a) Output-agnostic calibration minimizes the $\ell_2$ loss between the output of the quantized and original layers. b) Output-adaptive calibration matches the final output of the original and quantized models.
  • Figure 2: This figure shows our proposed steps to reduce the computational complexity of the output-adaptive Hessian. 1) The Hessian of each linear layer is independently computed. 2) The Hessian of each linear layer becomes block diagonal according to the rows independence assumption. 3) All of the row-wise Hessians are aggregated to reduce the memory footprint.
  • Figure 3: A demonstration of the OAC steps for 2-bit PTQ of LLMs. 1) The transformer blocks are iteratively selected for calibration. 2) The final outputs for the calibration samples are generated. 3) The generated outputs are compared with the ground truth outputs to compute the loss and gradients. 4) The output-adaptive Hessians of linear layers inside the block are approximated. 5) For each linear layer, the outliers are detected and isolated using $\widehat{\mathbf{H}}_\mathrm{OAC}$. 6) Column-wise calibration is performed to reduce the quantization error. 7) The quantization scales and zeros go through a second round of quantization to reduce the average bit width. Steps 5, 6, and 7 are integrated into our method from dettmers2024spqr.
  • Figure 4: A demonstration of equation \ref{['eq:fullRAH']} in Section \ref{['sec:aggregate']}, where a) shows the Hessian of each row and, b) shows the aggregation of row-wise Hessians.