Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers
Yunshan Zhong, Yuyao Zhou, Yuxin Zhang, Wanchen Sui, Shen Li, Yong Li, Fei Chao, Rongrong Ji
TL;DR
This work tackles data-free quantization for Vision Transformers by identifying semantic distortion and semantic inadequacy in synthetic data and introducing SARDFQ. The framework combines Attention Priors Alignment (APA) to align synthetic semantics with random structure priors, Multi-Semantic Reinforcement (MSR) to enrich content via localized patch optimization, and Soft-Label Learning (SL) to supervise multi-semantic outputs. Empirical results on ImageNet across ViT, DeiT, and Swin backbones show substantial gains over prior DFQ methods, especially at low bit-widths (e.g., notable improvements for $W4/A4$). These findings offer a practical, data-free route to robust ViT quantization, with clear paths for further theoretical grounding and performance narrowing toward real-data baselines.
Abstract
Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy. With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention. However, existing DFQ methods exhibit two limitations: (1) semantic distortion, where the semantics of synthetic images deviate substantially from those of real images, and (2) semantic inadequacy, where synthetic images contain extensive regions with limited content and oversimplified textures, leading to suboptimal quantization performance. To address these limitations, we propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs. To address semantic distortion, SARDFQ incorporates Attention Priors Alignment (APA), which optimizes synthetic images to follow randomly generated structure attention priors. To mitigate semantic inadequacy, SARDFQ introduces Multi-Semantic Reinforcement (MSR), leveraging localized patch optimization to enhance semantic richness across synthetic images. Furthermore, SARDFQ employs Soft-Label Learning (SL), wherein multiple semantic targets are adapted to facilitate the learning of multi-semantic images augmented by MSR. Extensive experiments demonstrate the effectiveness of SARDFQ, significantly surpassing existing methods. For example, SARDFQ improves top-1 accuracy on ImageNet by 15.52% for W4A4 ViT-B. The code is at https://github.com/zysxmu/SARDFQ.
