Table of Contents
Fetching ...

Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation

Shile Li, Markus Karmann, Onay Urfalioglu

TL;DR

A data-free calibration strategy that synthesizes diverse, label-free samples using Stable Diffusion Turbo guided by learned multi-mode prompts achieves performance on par with real-data ImageNet calibration and surpasses simple text-prompt baselines such as"aphoto of".

Abstract

We present a framework for end-to-end joint quantization of Vision Transformers trained on ImageNet for the purpose of image classification. Unlike prior post-training or block-wise reconstruction methods, we jointly optimize over the entire set of all layers and inter-block dependencies without any labeled data, scaling effectively with the number of samples and completing in just one hour on a single GPU for ViT-small. We achieve state-of-the-art W4A4 and W3A3 accuracies on ImageNet and, to the best of our knowledge, the first PTQ results that maintain strong accuracy on ViT, DeiT, and Swin-T models under extremely low-bit settings (W1.58A8), demonstrating the potential for efficient edge deployment. Furthermore, we introduce a data-free calibration strategy that synthesizes diverse, label-free samples using Stable Diffusion Turbo guided by learned multi-mode prompts. By encouraging diversity in both the learned prompt embeddings and the generated image features, our data-free approach achieves performance on par with real-data ImageNet calibration and surpasses simple text-prompt baselines such as "a <adjective> photo of <adjective> <cls>".

Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation

TL;DR

A data-free calibration strategy that synthesizes diverse, label-free samples using Stable Diffusion Turbo guided by learned multi-mode prompts achieves performance on par with real-data ImageNet calibration and surpasses simple text-prompt baselines such as"aphoto of".

Abstract

We present a framework for end-to-end joint quantization of Vision Transformers trained on ImageNet for the purpose of image classification. Unlike prior post-training or block-wise reconstruction methods, we jointly optimize over the entire set of all layers and inter-block dependencies without any labeled data, scaling effectively with the number of samples and completing in just one hour on a single GPU for ViT-small. We achieve state-of-the-art W4A4 and W3A3 accuracies on ImageNet and, to the best of our knowledge, the first PTQ results that maintain strong accuracy on ViT, DeiT, and Swin-T models under extremely low-bit settings (W1.58A8), demonstrating the potential for efficient edge deployment. Furthermore, we introduce a data-free calibration strategy that synthesizes diverse, label-free samples using Stable Diffusion Turbo guided by learned multi-mode prompts. By encouraging diversity in both the learned prompt embeddings and the generated image features, our data-free approach achieves performance on par with real-data ImageNet calibration and surpasses simple text-prompt baselines such as "a <adjective> photo of <adjective> <cls>".
Paper Structure (22 sections, 16 equations, 7 figures, 1 table)

This paper contains 22 sections, 16 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Our PTQ framework also supports a fully data-free mode using image synthesis via Stable Diffusion Turbo. For each ImageNet class (kite, tench, and mountain bike), top: images from simple text prompts show limited diversity and occasional semantic errors (e.g., kite as a toy rather than a bird). bottom: our learned multi-mode prompts (4 of 20 per class shown) generate semantically correct and diverse samples in layout, background, and style. These synthetic images are used for calibration in our data-free quantization pipeline.
  • Figure 2: Overview of the end-to-end quantization pipeline
  • Figure 3: Overview of the data-free multi-prompt learning pipeline. For each class (e.g., “tench”), multiple learned prompts are encoded by a frozen CLIP text encoder and used by Stable Diffusion-Turbo to synthesize diverse images under shared latent noise. A frozen ViT classifier provides supervision via classification loss, while orthogonality and variance losses encourage semantic and visual diversity.
  • Figure 4: Accuracy vs. calibration size for ViT-S (W4A4): comparison with FIMA-Q.
  • Figure 5: Scaling trends across models and bit settings. Performance gains starts to diminish beyond 10k calibration samples.
  • ...and 2 more figures