Table of Contents
Fetching ...

MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity

Kanghyun Choi, Hye Yoon Lee, Dain Kwon, SunJong Park, Kyuyeun Kim, Noseong Park, Jonghyun Choi, Jinho Lee

TL;DR

MimiQ is devised, a novel DFQ method designed for ViTs that enhances inter-head attention similarity and significantly outperforms baselines, setting a new state-of-the-art for ViT-DFQ.

Abstract

Data-free quantization (DFQ) is a technique that creates a lightweight network from its full-precision counterpart without the original training data, often through a synthetic dataset. Although several DFQ methods have been proposed for vision transformer (ViT) architectures, they fail to achieve efficacy in low-bit settings. Examining the existing methods, we observe that their synthetic data produce misaligned attention maps, while those of the real samples are highly aligned. From this observation, we find that aligning attention maps of synthetic data helps improve the overall performance of quantized ViTs. Motivated by this finding, we devise MimiQ, a novel DFQ method designed for ViTs that enhances inter-head attention similarity. First, we generate synthetic data by aligning head-wise attention outputs from each spatial query patch. Then, we align the attention maps of the quantized network to those of the full-precision teacher by applying head-wise structural attention distillation. The experimental results show that the proposed method significantly outperforms baselines, setting a new state-of-the-art for ViT-DFQ. This paper is an extended version of our work published in the proceedings of AAAI 2025, including additional supplementary material.

MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity

TL;DR

MimiQ is devised, a novel DFQ method designed for ViTs that enhances inter-head attention similarity and significantly outperforms baselines, setting a new state-of-the-art for ViT-DFQ.

Abstract

Data-free quantization (DFQ) is a technique that creates a lightweight network from its full-precision counterpart without the original training data, often through a synthetic dataset. Although several DFQ methods have been proposed for vision transformer (ViT) architectures, they fail to achieve efficacy in low-bit settings. Examining the existing methods, we observe that their synthetic data produce misaligned attention maps, while those of the real samples are highly aligned. From this observation, we find that aligning attention maps of synthetic data helps improve the overall performance of quantized ViTs. Motivated by this finding, we devise MimiQ, a novel DFQ method designed for ViTs that enhances inter-head attention similarity. First, we generate synthetic data by aligning head-wise attention outputs from each spatial query patch. Then, we align the attention maps of the quantized network to those of the full-precision teacher by applying head-wise structural attention distillation. The experimental results show that the proposed method significantly outperforms baselines, setting a new state-of-the-art for ViT-DFQ. This paper is an extended version of our work published in the proceedings of AAAI 2025, including additional supplementary material.
Paper Structure (33 sections, 15 equations, 13 figures, 12 tables)

This paper contains 33 sections, 15 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Attention similarity histograms of Base DF Synthesis (high and low similarity, random sampled) and MimiQ. Colored boxes denote ImageNet accuracy of the corresponding dataset. This motivational study shows the attention similarity is related to the DFQ accuracy of ViTs.
  • Figure 2: Attention visualization of synthetic samples and the original dataset. Compared to the original dataset, synthetic data from baselines present misaligned attention maps. The first row, a sample generated by MimiQ, shows aligned attention maps across attention heads. The measured average attention similarity (SSIM) further validates MimiQ shows the best alignment.
  • Figure 3: An overview of the proposed method. (a) The synthetic samples are initialized with Gaussian noise, then optimized with $\mathcal{L}_G$ (\ref{['eq:lg']}). The proposed inter-head coherency loss $\mathcal{L}_{IHC}$ measures the similarity of head-wise attention maps from the same query patch index. (b) The quantized network is trained to minimize output ($\mathcal{L}_{KL}$) and inter-head ($\mathcal{L}_{HAD}$) discrepancy.
  • Figure 4: Sensitivity analysis of synthetic dataset size.
  • Figure 5: Grad-CAM and LPIPS analysis of MimiQ.
  • ...and 8 more figures