CLIP-KD: An Empirical Study of CLIP Model Distillation

Chuanguang Yang; Zhulin An; Libo Huang; Junyu Bi; Xinqiang Yu; Han Yang; Boyu Diao; Yongjun Xu

CLIP-KD: An Empirical Study of CLIP Model Distillation

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, Yongjun Xu

TL;DR

CLIP-KD tackles the challenge of improving small CLIP models by distilling knowledge from a large teacher using a set of distillation strategies. The authors show that a simple feature distillation approach is highly effective, with interactive contrastive learning providing additional gains, and that maximizing teacher–student feature similarity explains performance improvements. The unified approach is validated across multiple teacher–student pairs and datasets, delivering consistent improvements in zero-shot ImageNet and cross-modal retrieval tasks. The work offers practical CLIP compression guidelines and demonstrates that architecture-agnostic distillation can bridge the gap between small models and large teachers, with released code for replication.

Abstract

Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well. Moreover, interactive contrastive learning across teacher and student encoders is also effective in performance improvement. We explain that the success of CLIP-KD can be attributed to maximizing the feature similarity between teacher and student. The unified method is applied to distill several student models trained on CC3M+12M. CLIP-KD improves student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the teacher, CLIP-KD achieves 57.5\% and 55.4\% zero-shot top-1 ImageNet accuracy over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5\% and 20.1\% margins, respectively. Our code is released on https://github.com/winycg/CLIP-KD.

CLIP-KD: An Empirical Study of CLIP Model Distillation

TL;DR

Abstract

Paper Structure (25 sections, 32 equations, 5 figures, 12 tables)

This paper contains 25 sections, 32 equations, 5 figures, 12 tables.

Introduction
Related Works
Methodology
A Brief Review of CLIP
CLIP Knowledge Distillation
Contrastive Relational Distillation
Feature Distillation
Masked Feature Distillation
Gradient Distillation
Interactive Contrastive Learning
Augmented Feature Distillation
Overall Loss of CLIP Distillation.
Experiments
Experimental Setup
Ablation Study of Distillation Losses
...and 10 more sections

Figures (5)

Figure 1: Illustration of various CLIP knowledge distillation approaches proposed in this paper.
Figure 2: Training curves trained on CC3M+12M for CLIP-KD.
Figure 3: Similarity statistics between teacher and student features after distillation trained on CC3M+12M.$v_{k}^{\mathbf{T}}$ and $v_{k}^{\mathbf{S}}$ denote the teacher and student image features, respectively. $s_{k}^{\mathbf{T}}$ and $s_{k}^{\mathbf{S}}$ denote the teacher and student text features, respectively.
Figure 4: Training curves using ViT-B/16 as the teacher and ViT-T/16 as the student for CLIP-KD compared to the baseline.
Figure 5: Top-1 accuracy on zero-shot ImageNet using intermediate feature distillation trained from CC3M+12M.

CLIP-KD: An Empirical Study of CLIP Model Distillation

TL;DR

Abstract

CLIP-KD: An Empirical Study of CLIP Model Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)