Table of Contents
Fetching ...

D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models

Wenlun Zhang, Yunshan Zhong, Zihao Ding, Xinyu Li, Kentaro Yoshioka

TL;DR

This work addresses the challenge of quantizing CLIP models without real calibration data, identifying semantic insufficiency and low intra-image diversity in prior data-free approaches. It introduces D4C, a CLIP-tailored DFQ framework that combines Prompt-Guided Semantic Injection, Structural Contrastive Generation, and Perturbation-Aware Enhancement to synthesize semantically meaningful and structurally diverse calibration samples, followed by an optimization-guided PTQ stage. Empirical results across CNN- and ViT-based CLIP encoders on CIFAR-10/100 and ImageNet-1K show substantial accuracy gains over baselines and prior DFQ methods, establishing state-of-the-art performance under various bit-widths. The method enables privacy-preserving deployment of CLIP models with practical improvements in zero-shot classification tasks and demonstrates scalable training costs and storage savings, positioning D4C as a robust baseline for CLIP data-free quantization.

Abstract

Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.

D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models

TL;DR

This work addresses the challenge of quantizing CLIP models without real calibration data, identifying semantic insufficiency and low intra-image diversity in prior data-free approaches. It introduces D4C, a CLIP-tailored DFQ framework that combines Prompt-Guided Semantic Injection, Structural Contrastive Generation, and Perturbation-Aware Enhancement to synthesize semantically meaningful and structurally diverse calibration samples, followed by an optimization-guided PTQ stage. Empirical results across CNN- and ViT-based CLIP encoders on CIFAR-10/100 and ImageNet-1K show substantial accuracy gains over baselines and prior DFQ methods, establishing state-of-the-art performance under various bit-widths. The method enables privacy-preserving deployment of CLIP models with practical improvements in zero-shot classification tasks and demonstrates scalable training costs and storage savings, positioning D4C as a robust baseline for CLIP data-free quantization.

Abstract

Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.

Paper Structure

This paper contains 27 sections, 6 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: UMAP UMAP visualization of features across various samples. Images generated using BNS or PSE losses are distant from real images, indicating limited semantic information. In contrast, D4C-generated samples closely match real data.
  • Figure 2: Patch similarity visualization across different images. Compared to Gaussian noise and BNS/PSE-based synthetic images, which exhibit weak or irregular internal patch relationships, D4C-generated images display structured similarity patterns closely resembling those of real images.
  • Figure 3: Overview of D4C: PGSI injects semantic information into synthetic samples through object concept prompting; SCG enhances structural diversity via foreground-background contrastive generation; and PAE introduces perturbations to further improve sample quality and expressiveness.
  • Figure 4: Visualization of synthetic samples generated by BNS and PSE (left) versus our proposed D4C framework (right) under both RN50 and VB32 encoders.