Table of Contents
Fetching ...

Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers

Yunshan Zhong, Yuyao Zhou, Yuxin Zhang, Wanchen Sui, Shen Li, Yong Li, Fei Chao, Rongrong Ji

TL;DR

This work tackles data-free quantization for Vision Transformers by identifying semantic distortion and semantic inadequacy in synthetic data and introducing SARDFQ. The framework combines Attention Priors Alignment (APA) to align synthetic semantics with random structure priors, Multi-Semantic Reinforcement (MSR) to enrich content via localized patch optimization, and Soft-Label Learning (SL) to supervise multi-semantic outputs. Empirical results on ImageNet across ViT, DeiT, and Swin backbones show substantial gains over prior DFQ methods, especially at low bit-widths (e.g., notable improvements for $W4/A4$). These findings offer a practical, data-free route to robust ViT quantization, with clear paths for further theoretical grounding and performance narrowing toward real-data baselines.

Abstract

Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy. With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention. However, existing DFQ methods exhibit two limitations: (1) semantic distortion, where the semantics of synthetic images deviate substantially from those of real images, and (2) semantic inadequacy, where synthetic images contain extensive regions with limited content and oversimplified textures, leading to suboptimal quantization performance. To address these limitations, we propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs. To address semantic distortion, SARDFQ incorporates Attention Priors Alignment (APA), which optimizes synthetic images to follow randomly generated structure attention priors. To mitigate semantic inadequacy, SARDFQ introduces Multi-Semantic Reinforcement (MSR), leveraging localized patch optimization to enhance semantic richness across synthetic images. Furthermore, SARDFQ employs Soft-Label Learning (SL), wherein multiple semantic targets are adapted to facilitate the learning of multi-semantic images augmented by MSR. Extensive experiments demonstrate the effectiveness of SARDFQ, significantly surpassing existing methods. For example, SARDFQ improves top-1 accuracy on ImageNet by 15.52% for W4A4 ViT-B. The code is at https://github.com/zysxmu/SARDFQ.

Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers

TL;DR

This work tackles data-free quantization for Vision Transformers by identifying semantic distortion and semantic inadequacy in synthetic data and introducing SARDFQ. The framework combines Attention Priors Alignment (APA) to align synthetic semantics with random structure priors, Multi-Semantic Reinforcement (MSR) to enrich content via localized patch optimization, and Soft-Label Learning (SL) to supervise multi-semantic outputs. Empirical results on ImageNet across ViT, DeiT, and Swin backbones show substantial gains over prior DFQ methods, especially at low bit-widths (e.g., notable improvements for ). These findings offer a practical, data-free route to robust ViT quantization, with clear paths for further theoretical grounding and performance narrowing toward real-data baselines.

Abstract

Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy. With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention. However, existing DFQ methods exhibit two limitations: (1) semantic distortion, where the semantics of synthetic images deviate substantially from those of real images, and (2) semantic inadequacy, where synthetic images contain extensive regions with limited content and oversimplified textures, leading to suboptimal quantization performance. To address these limitations, we propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs. To address semantic distortion, SARDFQ incorporates Attention Priors Alignment (APA), which optimizes synthetic images to follow randomly generated structure attention priors. To mitigate semantic inadequacy, SARDFQ introduces Multi-Semantic Reinforcement (MSR), leveraging localized patch optimization to enhance semantic richness across synthetic images. Furthermore, SARDFQ employs Soft-Label Learning (SL), wherein multiple semantic targets are adapted to facilitate the learning of multi-semantic images augmented by MSR. Extensive experiments demonstrate the effectiveness of SARDFQ, significantly surpassing existing methods. For example, SARDFQ improves top-1 accuracy on ImageNet by 15.52% for W4A4 ViT-B. The code is at https://github.com/zysxmu/SARDFQ.

Paper Structure

This paper contains 21 sections, 15 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of (a) semantic distortion and (b) semantic inadequacy.
  • Figure 2: SARDFQ Framework overview: Attention Priors Alignment (APA) employs randomly generated attention priors to improve semantics alignment. Multi-Semantic Reinforcement (MSR) learns the different regions of synthetic images with various semantics to enhance overall semantic richness. Meanwhile, Softlabel Learning (SL) adopts multiple semantic targets to ensure consistent learning of multi-semantic images augmented by MSR.
  • Figure 3: Comparison between attention maps.
  • Figure 4: Examples of generated attention priors.
  • Figure 5: Effect of varying (a) $\alpha_1$ and (b) $K_{MSR}$.