Table of Contents
Fetching ...

CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

Jeannie Chung, Hanna Jang, Ingyeong Yang, Uiwon Hwang, Jaehyeong Sim

Abstract

CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.

CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

Abstract

CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.

Paper Structure

This paper contains 33 sections, 41 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of CLIP-RD.
  • Figure 2: Training loss. The y-axis is clipped at 4.0 to emphasize relative trends after the initial warm-up phase.
  • Figure 3: Accuracy on IN and R@1 on CC3M Val.
  • Figure 4: Positive and negative pair similarity and CC3M validation set retrieval performance with CLIP-KD and CLIP-RD.
  • Figure 5: Positive and negative pair similarity distribution with CLIP-KD and CLIP-RD. As in Figure \ref{['pair_sim']}, higher positive (a) and lower negative (b) similarity indicate better alignment for representation space.
  • ...and 1 more figures