Table of Contents
Fetching ...

CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding

Ziteng Sun, Adrian Benton, Samuel Kushnir, Asher Trockman, Vikas Singh, Suhas Diggavi, Ananda Theertha Suresh

TL;DR

This work tackles calibration-free post-training quantization for large language models by learning weight-transformations and adaptive rounding to minimize quantization loss without calibration data. It introduces a proxy Frobenius-norm objective and two classes of learnable transforms: block-diagonal matrices for single weight matrices and arbitrary transformations for coupled matrix pairs, plus a joint/iterative quantization scheme to reduce matrix-product errors. Empirical results on Gemma-2 models show consistent improvements over calibration-free baselines, achieving notable gains for 4-bit and 3-bit quantization with less than 3% extra computation and competitive performance relative to GPTQ. The approach offers a privacy-preserving, data-independent PTQ pathway with practical impact for deploying quantized LLMs in real-world, regulation-sensitive settings, and points to fruitful future work in combining with calibration-based techniques.

Abstract

Post-training quantization is an effective method for reducing the serving cost of large language models, where the standard approach is to use a round-to-nearest quantization level scheme. However, this often introduces large errors due to outliers in the weights. Proposed mitigation mechanisms include applying adaptive rounding, random rotation transformations or committing to a post-training target using calibration data. Unfortunately, this reliance on calibration data can be severely limiting in some real-world scenarios as such data may be unavailable or subject to privacy regulations. In this paper, we propose algorithms to optimize transformations and adaptive rounding without access to any calibration data. The optimization is achieved by designing a suitable proxy function for the quantization loss without calibration data. To maintain inference efficiency, we perform structured matrix transformations for single matrices. For paired weights that interact directly in the computation graph, we use dual matrix transformations and adaptive rounding methods. We conduct experiments on Gemma 2 models, and observe consistent improvement over the baselines. For Gemma 2 9B quantization, our method improves the average benchmark score from 61.9 to 62.4 for 4-bit quantization and from 52.0 to 60.6 for 3-bit quantization, while adding less than 3% of computation overhead. Furthermore, our method achieves performance comparable to the commonly used GPTQ method, which requires calibration data.

CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding

TL;DR

This work tackles calibration-free post-training quantization for large language models by learning weight-transformations and adaptive rounding to minimize quantization loss without calibration data. It introduces a proxy Frobenius-norm objective and two classes of learnable transforms: block-diagonal matrices for single weight matrices and arbitrary transformations for coupled matrix pairs, plus a joint/iterative quantization scheme to reduce matrix-product errors. Empirical results on Gemma-2 models show consistent improvements over calibration-free baselines, achieving notable gains for 4-bit and 3-bit quantization with less than 3% extra computation and competitive performance relative to GPTQ. The approach offers a privacy-preserving, data-independent PTQ pathway with practical impact for deploying quantized LLMs in real-world, regulation-sensitive settings, and points to fruitful future work in combining with calibration-based techniques.

Abstract

Post-training quantization is an effective method for reducing the serving cost of large language models, where the standard approach is to use a round-to-nearest quantization level scheme. However, this often introduces large errors due to outliers in the weights. Proposed mitigation mechanisms include applying adaptive rounding, random rotation transformations or committing to a post-training target using calibration data. Unfortunately, this reliance on calibration data can be severely limiting in some real-world scenarios as such data may be unavailable or subject to privacy regulations. In this paper, we propose algorithms to optimize transformations and adaptive rounding without access to any calibration data. The optimization is achieved by designing a suitable proxy function for the quantization loss without calibration data. To maintain inference efficiency, we perform structured matrix transformations for single matrices. For paired weights that interact directly in the computation graph, we use dual matrix transformations and adaptive rounding methods. We conduct experiments on Gemma 2 models, and observe consistent improvement over the baselines. For Gemma 2 9B quantization, our method improves the average benchmark score from 61.9 to 62.4 for 4-bit quantization and from 52.0 to 60.6 for 3-bit quantization, while adding less than 3% of computation overhead. Furthermore, our method achieves performance comparable to the commonly used GPTQ method, which requires calibration data.

Paper Structure

This paper contains 23 sections, 19 equations, 2 figures, 8 tables, 1 algorithm.

Figures (2)

  • Figure 1: Distribution over the VO product relative PQE as a function of the temperature on the LogSumExp pseudo-loss, optimized with Adam. Each violin encompasses a sweep over learning rate in $10^{\{-4, -3, -2, -1\}}$, and orthogonal regularization weight in $\{0, 0.1, 1\}$ for the given value of $t$. Note that we exclude runs with learning rate of 1.0, as these runs diverged (\ref{['fig:tune_lr']}).
  • Figure 2: Learning curves for Cayley SGD (top) and Adam with various orthonormal regularization weights (center and bottom), with each pseudo-loss a separate line. Learning rate is varied along columns. PQE is on the y-axis and iteration count on the x-axis. Each line corresponds to the mean PQE across all model layers for a given pseudo-loss, with the 95% bootstrap confidence interval indicated by the shaded region.