Table of Contents
Fetching ...

BayesQ: Uncertainty-Guided Bayesian Quantization

Ismail Lamaakal, Chaymae Yahyati, Yassine Maleh, Khalid El Makkaoui, Ibrahim Ouahbi

TL;DR

BayesQ reframes post-training quantization as uncertainty-aware risk minimization by placing a lightweight Gaussian posterior over weights after standard training and optimizing quantization under the posterior-expected loss. It whitenes in the posterior space, designs per-block, mixed-precision quantizers (uniform or non-uniform) using closed-form or MC loss estimates, and allocates bits via a greedy knapsack with hardware-aware constraints. An optional calibration-only distillation aligns the quantized model with the posterior predictive teacher, improving calibration at tight budgets. Empirically, BayesQ achieves state-of-the-art accuracy at fixed memory on ResNet-50 and BERT-base across 3.0–4.0 bit budgets, with the largest gains at the tightest regime, and requires one-time preprocessing comparable to GPTQ.

Abstract

We present BayesQ, an uncertainty-guided post-training quantization framework that is the first to optimize quantization under the posterior expected loss. BayesQ fits a lightweight Gaussian posterior over weights (diagonal Laplace by default; optional K-FAC/low-rank), whitens by the posterior covariance, designs codebooks to minimize posterior-expected distortion, and allocates mixed precision via a greedy knapsack that maximizes marginal expected-loss reduction per bit under a global budget. For scalar quantizers, posterior-expected MSE yields closed-form tables; task-aware proxies are handled by short Monte Carlo on a small calibration set. An optional calibration-only distillation aligns the quantized model with the posterior predictive teacher. At matched average bits/weight of 3.0/3.5/4.0, BayesQ improves over strong PTQ baselines on ResNet-50 (ImageNet) and BERT-base (GLUE) e.g., vs. GPTQ by $+1.5/+0.7/+0.3$ top-1 percentage points on RN50 and $+1.1/+0.4/+0.2$ GLUE points on BERT, while requiring one-time preprocessing comparable to a GPTQ pass. BayesQ reframes low-bit quantization as uncertainty-aware risk minimization in a practical, post-training pipeline.

BayesQ: Uncertainty-Guided Bayesian Quantization

TL;DR

BayesQ reframes post-training quantization as uncertainty-aware risk minimization by placing a lightweight Gaussian posterior over weights after standard training and optimizing quantization under the posterior-expected loss. It whitenes in the posterior space, designs per-block, mixed-precision quantizers (uniform or non-uniform) using closed-form or MC loss estimates, and allocates bits via a greedy knapsack with hardware-aware constraints. An optional calibration-only distillation aligns the quantized model with the posterior predictive teacher, improving calibration at tight budgets. Empirically, BayesQ achieves state-of-the-art accuracy at fixed memory on ResNet-50 and BERT-base across 3.0–4.0 bit budgets, with the largest gains at the tightest regime, and requires one-time preprocessing comparable to GPTQ.

Abstract

We present BayesQ, an uncertainty-guided post-training quantization framework that is the first to optimize quantization under the posterior expected loss. BayesQ fits a lightweight Gaussian posterior over weights (diagonal Laplace by default; optional K-FAC/low-rank), whitens by the posterior covariance, designs codebooks to minimize posterior-expected distortion, and allocates mixed precision via a greedy knapsack that maximizes marginal expected-loss reduction per bit under a global budget. For scalar quantizers, posterior-expected MSE yields closed-form tables; task-aware proxies are handled by short Monte Carlo on a small calibration set. An optional calibration-only distillation aligns the quantized model with the posterior predictive teacher. At matched average bits/weight of 3.0/3.5/4.0, BayesQ improves over strong PTQ baselines on ResNet-50 (ImageNet) and BERT-base (GLUE) e.g., vs. GPTQ by top-1 percentage points on RN50 and GLUE points on BERT, while requiring one-time preprocessing comparable to a GPTQ pass. BayesQ reframes low-bit quantization as uncertainty-aware risk minimization in a practical, post-training pipeline.

Paper Structure

This paper contains 180 sections, 117 equations, 2 figures, 11 tables, 2 algorithms.

Figures (2)

  • Figure 1: End-to-end BayesQ pipeline: starting from a pretrained network and a small unlabeled calibration set, we fit a lightweight Gaussian posterior over weights (diagonal Laplace by default, with optional K-FAC/low-rank) and derive a whitener to work in an isotropic space; for each block and candidate bit-width, we design quantizers (uniform with optimized range or posterior-weighted non-uniform codebooks) and build per-block expected-loss tables using closed-form MSE or short Monte Carlo proxies; a greedy knapsack then allocates bits under a global storage budget by selecting upgrades with the largest expected-loss reduction per extra bit while respecting hardware packing; an optional calibration-only distillation aligns the quantized model to the posterior predictive teacher; finally, we export mixed-precision integer weights with per-block metadata for deployment on standard INT kernels (FP16 activations by default), with one-time preprocessing cost comparable to GPTQ.
  • Figure 2: Accuracy--bit frontiers with shaded $\pm1$ std across three seeds. Left: RN50 (ImageNet). Right: BERT-base (GLUE avg). We consistently dominate GPTQ at matched budgets, with the largest margin at 3.0 bits.