Table of Contents
Fetching ...

CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs

Akshat Ramachandran, Souvik Kundu, Tushar Krishna

TL;DR

Extensive evaluations across various vision tasks demonstrate the superiority of CLAMP-ViT, with performance improvements of up to 3% in top-1 accuracy for classification, 0.6 mAP for object detection, and 1.5 mIoU for segmentation at similar or better compression ratio over existing alternatives.

Abstract

We present CLAMP-ViT, a data-free post-training quantization method for vision transformers (ViTs). We identify the limitations of recent techniques, notably their inability to leverage meaningful inter-patch relationships, leading to the generation of simplistic and semantically vague data, impacting quantization accuracy. CLAMP-ViT employs a two-stage approach, cyclically adapting between data generation and model quantization. Specifically, we incorporate a patch-level contrastive learning scheme to generate richer, semantically meaningful data. Furthermore, we leverage contrastive learning in layer-wise evolutionary search for fixed- and mixed-precision quantization to identify optimal quantization parameters while mitigating the effects of a non-smooth loss landscape. Extensive evaluations across various vision tasks demonstrate the superiority of CLAMP-ViT, with performance improvements of up to 3% in top-1 accuracy for classification, 0.6 mAP for object detection, and 1.5 mIoU for segmentation at similar or better compression ratio over existing alternatives. Code is available at https://github.com/georgia-tech-synergy-lab/CLAMP-ViT.git

CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs

TL;DR

Extensive evaluations across various vision tasks demonstrate the superiority of CLAMP-ViT, with performance improvements of up to 3% in top-1 accuracy for classification, 0.6 mAP for object detection, and 1.5 mIoU for segmentation at similar or better compression ratio over existing alternatives.

Abstract

We present CLAMP-ViT, a data-free post-training quantization method for vision transformers (ViTs). We identify the limitations of recent techniques, notably their inability to leverage meaningful inter-patch relationships, leading to the generation of simplistic and semantically vague data, impacting quantization accuracy. CLAMP-ViT employs a two-stage approach, cyclically adapting between data generation and model quantization. Specifically, we incorporate a patch-level contrastive learning scheme to generate richer, semantically meaningful data. Furthermore, we leverage contrastive learning in layer-wise evolutionary search for fixed- and mixed-precision quantization to identify optimal quantization parameters while mitigating the effects of a non-smooth loss landscape. Extensive evaluations across various vision tasks demonstrate the superiority of CLAMP-ViT, with performance improvements of up to 3% in top-1 accuracy for classification, 0.6 mAP for object detection, and 1.5 mIoU for segmentation at similar or better compression ratio over existing alternatives. Code is available at https://github.com/georgia-tech-synergy-lab/CLAMP-ViT.git
Paper Structure (22 sections, 7 equations, 7 figures, 13 tables, 1 algorithm)

This paper contains 22 sections, 7 equations, 7 figures, 13 tables, 1 algorithm.

Figures (7)

  • Figure 1: Visualization of loss landscape of (a) PSAQ-ViT v2 and (b) CLAMP-ViT on DeiT-S with perturbations to quantized model weights and synthetic data li2018visualizing.
  • Figure 2: Overview of the cyclically evolving two-stage CLAMP-ViT framework. In stage 1 (① - ②), $\mathcal{L}^{SG}$ is minimized to update Gaussian noise towards synthesizing data. Stage 2 (③ - ⑤) conducts layer-wise evolutionary search to identify optimal quantization parameters while minimizes $\mathcal{L}^F$. Illustrated with multiple instances of models for clarity, only one instance of each model is used in the framework.
  • Figure 3: Intuitive visualization of positive and negative patch selection in Stage 1.
  • Figure 4: Comparison of synthetic data generated by (a) PSAQ-ViT v1 li2022patch, (b) PSAQ-ViT v2 li2023psaq and (c) CLAMP-ViT (Ours). CLAMP-ViT generates detailed objects within contextually suitable backgrounds, boosting realism and informativeness.
  • Figure 5: CLAMP-ViT ablations for (a) Selecting evolutionary search parameters, (b) Mixed-precision quantization accuracy with different fitness functions, (c) Effect of batch size $\mathcal{B}$ and (d) Effect of neighborhood size $\mathcal{N}$ and top-$n$ positive patches.
  • ...and 2 more figures