Cat: Post-Training Quantization Error Reduction via Cluster-based Affine Transformation

Ali Zoljodi; Radu Timofte; Masoud Daneshtalab

Cat: Post-Training Quantization Error Reduction via Cluster-based Affine Transformation

Ali Zoljodi, Radu Timofte, Masoud Daneshtalab

TL;DR

The paper tackles accuracy degradation in post-training quantization (PTQ) at ultra-low bit-widths by introducing Cluster-based Affine Transformation (CAT), which applies cluster-specific affine corrections in the logit space after a KL-divergence–based refinement of quantization parameters. CAT leverages PCA to reduce logits, K-means clustering to form clusters, and per-cluster affine terms $(oldsymbol{ extgamma}_k,oldsymbol{eta}_k)$ estimated from cluster statistics to align low- and full-precision outputs, with an $ ilde{z}$ shown as $ ilde{z} = (1-oldsymbol{ extalpha}) z_{LQ} + oldsymbol{ extalpha}ig(oldsymbol{ extgamma}_k oldsymbol{z}_{LQ} + oldsymbol{eta}_kig)$ during inference. The approach yields consistent, state-of-the-art improvements on ImageNet-1K across ResNet, MobileNetV2, RegNet, MNasX2, and ViT/DeiT models, particularly in 2-bit activation scenarios, while adding only negligible parameter overhead and functioning as a plug-in for existing PTQ pipelines. Extensive ablations show CAT’s benefits are most pronounced when the FP–LQ gap is large, with optimal alpha around 0.3–0.4 and small cluster counts for highly distorted logits, providing practical guidance for deployment.

Abstract

Post-Training Quantization (PTQ) reduces the memory footprint and computational overhead of deep neural networks by converting full-precision (FP) values into quantized and compressed data types. While PTQ is more cost-efficient than Quantization-Aware Training (QAT), it is highly susceptible to accuracy degradation under a low-bit quantization (LQ) regime (e.g., 2-bit). Affine transformation is a classical technique used to reduce the discrepancy between the information processed by a quantized model and that processed by its full-precision counterpart; however, we find that using plain affine transformation, which applies a uniform affine parameter set for all outputs, worsens the results in low-bit PTQ. To address this, we propose Cluster-based Affine Transformation (CAT), an error-reduction framework that employs cluster-specific parameters to align LQ outputs with FP counterparts. CAT refines LQ outputs with only a negligible number of additional parameters, without requiring fine-tuning of the model or quantization parameters. We further introduce a novel PTQ framework integrated with CAT. Experiments on ImageNet-1K show that this framework consistently outperforms prior PTQ methods across diverse architectures and LQ settings, achieving up to 53.18% Top-1 accuracy on W2A2 ResNet-18. Moreover, CAT enhances existing PTQ baselines by more than 3% when used as a plug-in. We plan to release our implementation alongside the publication of this paper.

Cat: Post-Training Quantization Error Reduction via Cluster-based Affine Transformation

TL;DR

Abstract

Cat: Post-Training Quantization Error Reduction via Cluster-based Affine Transformation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)