Table of Contents
Fetching ...

GranQ: Efficient Channel-wise Quantization via Vectorized Pre-Scaling for Zero-Shot QAT

Inpyo Hong, Youngwan Jo, Hyojeong Lee, Sunghyun Ahn, Kijung Lee, Sanghyun Park

TL;DR

GranQ tackles zero-shot quantization by moving beyond layer-wise activation quantization to a granular, per-channel scheme that uses vectorized pre-scaling to enable efficient parallel accumulation. By reshaping activations (activation decomposition) and computing per-channel scaling in a vectorized form, GranQ preserves activation information under low-bit QAT while eliminating runtime scaling overhead. Across CIFAR-10/100 and ImageNet, GranQ delivers state-of-the-art accuracy gains (e.g., +5.45% on CIFAR-100 at 3-bit) and even surpasses FP performance on CIFAR-10 at 5-bit, while maintaining near-layer-wise latency. This approach offers a practical, hardware-friendly solution to the core activation challenge in data-free quantization, with strong potential for broader deployment in constrained-data regimes.

Abstract

Zero-shot quantization (ZSQ) enables neural network compression without original training data, making it a promising solution for restricted data access scenarios. To compensate for the lack of data, recent ZSQ methods typically rely on synthetic inputs generated from the full-precision model. However, these synthetic inputs often lead to activation distortion, especially under low-bit settings. To mitigate this, existing methods typically employ per-channel scaling, but they still struggle due to the severe computational overhead during the accumulation process. To overcome this critical bottleneck, we propose GranQ, a novel activation quantization framework that introduces an efficient pre-scaling strategy. Unlike conventional channel-wise methods that repeatedly perform scaling operations during accumulation, GranQ applies scaling factors in a pre-scaling step through fully vectorized computation, eliminating runtime scaling overhead. This design enables GranQ to maintain fine-grained quantization accuracy while significantly reducing computational burden, particularly in low-bit quantization settings. Extensive experiments under quantization-aware training (QAT) settings demonstrate that GranQ consistently outperforms state-of-the-art ZSQ methods across CIFAR and ImageNet. In particular, our method achieves up to 5.45% higher accuracy in the 3-bit setting on CIFAR-100 and even surpasses the full-precision baseline on CIFAR-10.

GranQ: Efficient Channel-wise Quantization via Vectorized Pre-Scaling for Zero-Shot QAT

TL;DR

GranQ tackles zero-shot quantization by moving beyond layer-wise activation quantization to a granular, per-channel scheme that uses vectorized pre-scaling to enable efficient parallel accumulation. By reshaping activations (activation decomposition) and computing per-channel scaling in a vectorized form, GranQ preserves activation information under low-bit QAT while eliminating runtime scaling overhead. Across CIFAR-10/100 and ImageNet, GranQ delivers state-of-the-art accuracy gains (e.g., +5.45% on CIFAR-100 at 3-bit) and even surpasses FP performance on CIFAR-10 at 5-bit, while maintaining near-layer-wise latency. This approach offers a practical, hardware-friendly solution to the core activation challenge in data-free quantization, with strong potential for broader deployment in constrained-data regimes.

Abstract

Zero-shot quantization (ZSQ) enables neural network compression without original training data, making it a promising solution for restricted data access scenarios. To compensate for the lack of data, recent ZSQ methods typically rely on synthetic inputs generated from the full-precision model. However, these synthetic inputs often lead to activation distortion, especially under low-bit settings. To mitigate this, existing methods typically employ per-channel scaling, but they still struggle due to the severe computational overhead during the accumulation process. To overcome this critical bottleneck, we propose GranQ, a novel activation quantization framework that introduces an efficient pre-scaling strategy. Unlike conventional channel-wise methods that repeatedly perform scaling operations during accumulation, GranQ applies scaling factors in a pre-scaling step through fully vectorized computation, eliminating runtime scaling overhead. This design enables GranQ to maintain fine-grained quantization accuracy while significantly reducing computational burden, particularly in low-bit quantization settings. Extensive experiments under quantization-aware training (QAT) settings demonstrate that GranQ consistently outperforms state-of-the-art ZSQ methods across CIFAR and ImageNet. In particular, our method achieves up to 5.45% higher accuracy in the 3-bit setting on CIFAR-100 and even surpasses the full-precision baseline on CIFAR-10.

Paper Structure

This paper contains 22 sections, 6 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison between (a) layer-wise quantization and (b) GranQ on the CIFAR-10. Each subfigure visualizes the 32-bit FP (left) and 3-bit quantized (right) activations of the first ResNet-20 layer. GranQ better preserves the original activation with minimal distortion.
  • Figure 2: Overview of the GranQ algorithm. 1 Each activation map $A_l$ is decomposed into channel-wise vectors, which are used to compute scaling factors ($\vec{s}_c$) and zero-points ($\vec{z}_c$) in a vectorized form. 2 The calculated scaling factor is applied in advance (pre-scaled) to the quantized activations through parallel lanes, enabling efficient parallel accumulation.
  • Figure 3: Latency of ResNet-20 quantization across batch sizes on CIFAR-100 with 3-bit setting.
  • Figure 4: Relative quantization error across layers in ResNet-20 with 3-bit quantization on CIFAR-100.
  • Figure 5: Pixel distribution comparison between (a) original and (b) synthetic inputs on ImageNet. Synthetic inputs generated by AdaDFQ 22 exhibit a rightward shift with higher skewness.