Table of Contents
Fetching ...

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination

Kaicheng Yang, Tiancheng Gu, Xiang An, Haiqiang Jiang, Xiangzi Dai, Ziyong Feng, Weidong Cai, Jiankang Deng

TL;DR

CLIP-CID tackles the resource challenge of vision-language pre-training by distilling from a large teacher to a smaller student. It combines an image semantic balance step that aggressively filters semantic redundancy (reducing LAION400M to LAION225M) with cluster-instance discrimination to capture rich semantic structure beyond instance-level signals, and an instance-level distillation stage to improve cross-modal alignment. The approach yields state-of-the-art or competitive linear probe and zero-shot performance across 14 downstream datasets while using substantially less data and compute. This semantic-aware distillation framework promises practical efficiency gains for deploying vision-language models in resource-constrained settings.

Abstract

Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over a wide range of tasks. However, the effectiveness of CLIP heavily relies on a substantial corpus of pre-training data, resulting in notable consumption of computational resources. Although knowledge distillation has been widely applied in single modality models, how to efficiently expand knowledge distillation to vision-language foundation models with extensive data remains relatively unexplored. In this paper, we introduce CLIP-CID, a novel distillation mechanism that effectively transfers knowledge from a large vision-language foundation model to a smaller model. We initially propose a simple but efficient image semantic balance method to reduce transfer learning bias and improve distillation efficiency. This method filters out 43.7% of image-text pairs from the LAION400M while maintaining superior performance. After that, we leverage cluster-instance discrimination to facilitate knowledge transfer from the teacher model to the student model, thereby empowering the student model to acquire a holistic semantic comprehension of the pre-training data. Experimental results demonstrate that CLIP-CID achieves state-of-the-art performance on various downstream tasks including linear probe and zero-shot classification.

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination

TL;DR

CLIP-CID tackles the resource challenge of vision-language pre-training by distilling from a large teacher to a smaller student. It combines an image semantic balance step that aggressively filters semantic redundancy (reducing LAION400M to LAION225M) with cluster-instance discrimination to capture rich semantic structure beyond instance-level signals, and an instance-level distillation stage to improve cross-modal alignment. The approach yields state-of-the-art or competitive linear probe and zero-shot performance across 14 downstream datasets while using substantially less data and compute. This semantic-aware distillation framework promises practical efficiency gains for deploying vision-language models in resource-constrained settings.

Abstract

Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over a wide range of tasks. However, the effectiveness of CLIP heavily relies on a substantial corpus of pre-training data, resulting in notable consumption of computational resources. Although knowledge distillation has been widely applied in single modality models, how to efficiently expand knowledge distillation to vision-language foundation models with extensive data remains relatively unexplored. In this paper, we introduce CLIP-CID, a novel distillation mechanism that effectively transfers knowledge from a large vision-language foundation model to a smaller model. We initially propose a simple but efficient image semantic balance method to reduce transfer learning bias and improve distillation efficiency. This method filters out 43.7% of image-text pairs from the LAION400M while maintaining superior performance. After that, we leverage cluster-instance discrimination to facilitate knowledge transfer from the teacher model to the student model, thereby empowering the student model to acquire a holistic semantic comprehension of the pre-training data. Experimental results demonstrate that CLIP-CID achieves state-of-the-art performance on various downstream tasks including linear probe and zero-shot classification.
Paper Structure (25 sections, 7 equations, 8 figures, 10 tables)

This paper contains 25 sections, 7 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: The linear probe performance comparison between CLIP-CID and OPENCLIP across 14 common datasets. Despite the exclusion of 43.7% of image-text pairs from the LAION400M, CLIP-CID exhibits exceptional performance.
  • Figure 2: (a) and (b) visualization of the perceptual redundancy images and semantic redundancy images. (c) visualization of the image semantic balance process. (d) distribution of LAION400M and LAION225M in 1M clusters.
  • Figure 3: The architecture of our proposed cluster-instance discrimination distillation.
  • Figure 4: Weight distribution of the last fully connected layer in the middle and last transformer layers.
  • Figure 5: Visualization of PCA components. We extract three principal components from the collected patch features of each image. The principal components are then visualized using separate color channels. Similar colors within patches indicate semantic similarities. We use $\textcolor{red}{\circ}$ to accentuate the primary distinction.
  • ...and 3 more figures