CLIP-Embed-KD: Computationally Efficient Knowledge Distillation Using Embeddings as Teachers

Lakshmi Nair

CLIP-Embed-KD: Computationally Efficient Knowledge Distillation Using Embeddings as Teachers

Lakshmi Nair

TL;DR

Preliminary findings show that CLIP-based knowledge distillation with embeddings can outperform full scale knowledge distillation using $9\times less memory and $8\times$ less training time.

Abstract

Contrastive Language-Image Pre-training (CLIP) has been shown to improve zero-shot generalization capabilities of language and vision models. In this paper, we extend CLIP for efficient knowledge distillation, by utilizing embeddings as teachers. Typical knowledge distillation frameworks require running forward passes through a teacher model, which is often prohibitive in the case of billion or trillion parameter teachers. In these cases, using only the embeddings of the teacher models to guide the distillation can yield significant computational savings. Our preliminary findings show that CLIP-based knowledge distillation with embeddings can outperform full scale knowledge distillation using $9\times$ less memory and $8\times$ less training time. Code available at: https://github.com/lnairGT/CLIP-Distillation/

CLIP-Embed-KD: Computationally Efficient Knowledge Distillation Using Embeddings as Teachers

TL;DR

Preliminary findings show that CLIP-based knowledge distillation with embeddings can outperform full scale knowledge distillation using

8\times$ less training time.

Abstract

less memory and

less training time. Code available at: https://github.com/lnairGT/CLIP-Distillation/

Paper Structure (9 sections, 2 equations, 5 figures, 3 tables)

This paper contains 9 sections, 2 equations, 5 figures, 3 tables.

Introduction
Approach
Knowledge distillation
CLIP-Teacher-KD
CLIP-Embed-KD
Experiments and Results
CLIP distillation loss vs. regular KD loss
CLIP-Teacher-KD vs. CLIP-Embed-KD
Future Work

Figures (5)

Figure 1: CLIP-Teacher-KD uses the teacher model to generate embeddings for every training sample. CLIP-Embed-KD uses pre-computed teacher embeddings thus avoiding the need to run forward passes through the teacher model for every sample.
Figure 2: Training and validation curves of base student model.
Figure 3: Training and validation curves of large student model.
Figure 4: Training resource utilization (with Large-32 teacher) of CLIP-Embed-KD, CLIP-Teacher-KD [student-size][image-size]: CLIP-Embed-KD scales well (to larger models and larger images) to outperform CLIP-Teacher-KD for much less memory.
Figure 5: Accuracy vs. $\alpha_2$ ($\alpha_1 = 1 - \alpha_2$). Non-zero weighting of both losses yield the best performance ($\alpha_2 = 1$ causes overfitting).

CLIP-Embed-KD: Computationally Efficient Knowledge Distillation Using Embeddings as Teachers

TL;DR

Abstract

CLIP-Embed-KD: Computationally Efficient Knowledge Distillation Using Embeddings as Teachers

Authors

TL;DR

Abstract

Table of Contents

Figures (5)