Table of Contents
Fetching ...

Enriching Knowledge Distillation with Cross-Modal Teacher Fusion

Amir M. Mansourian, Amir Mohammad Babaei, Shohreh Kasaei

TL;DR

vanilla knowledge distillation often relies on unimodal visual cues, limiting generalization. RichKD fuses a task-specific teacher with CLIP to inject cross-modal semantic guidance through multi-prompt logits and feature fusion, formalized with $z_F(x) = \alpha z_T(x) + (1-\alpha) z_C(x)$ and $f_F(x) = \lambda f_T(x) + (1-\lambda) f_C(x)$. The approach is theoretically motivated by a bias-variance view and empirically validated on CIFAR-100 and ImageNet, showing improved accuracy and robustness and compatibility with other distillation losses. The work demonstrates that vision-language models can provide meaningful, semantically grounded supervision to enhance knowledge distillation for better generalization and resilience across datasets and corruptions.

Abstract

Multi-teacher knowledge distillation (KD), a more effective technique than traditional single-teacher methods, transfers knowledge from expert teachers to a compact student model using logit or feature matching. However, most existing approaches lack knowledge diversity, as they rely solely on unimodal visual information, overlooking the potential of cross-modal representations. In this work, we explore the use of CLIP's vision-language knowledge as a complementary source of supervision for KD, an area that remains largely underexplored. We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP. By incorporating CLIP's multi-prompt textual guidance, the fused supervision captures both dataset-specific and semantically enriched visual cues. Beyond accuracy, analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones. Moreover, fusion with CLIP refines the entire logit distribution, producing semantically meaningful probabilities for non-target classes, thereby improving inter-class consistency and distillation quality. Despite its simplicity, the proposed method, Enriching Knowledge Distillation (RichKD), consistently outperforms most existing baselines across multiple benchmarks and exhibits stronger robustness under distribution shifts and input corruptions.

Enriching Knowledge Distillation with Cross-Modal Teacher Fusion

TL;DR

vanilla knowledge distillation often relies on unimodal visual cues, limiting generalization. RichKD fuses a task-specific teacher with CLIP to inject cross-modal semantic guidance through multi-prompt logits and feature fusion, formalized with and . The approach is theoretically motivated by a bias-variance view and empirically validated on CIFAR-100 and ImageNet, showing improved accuracy and robustness and compatibility with other distillation losses. The work demonstrates that vision-language models can provide meaningful, semantically grounded supervision to enhance knowledge distillation for better generalization and resilience across datasets and corruptions.

Abstract

Multi-teacher knowledge distillation (KD), a more effective technique than traditional single-teacher methods, transfers knowledge from expert teachers to a compact student model using logit or feature matching. However, most existing approaches lack knowledge diversity, as they rely solely on unimodal visual information, overlooking the potential of cross-modal representations. In this work, we explore the use of CLIP's vision-language knowledge as a complementary source of supervision for KD, an area that remains largely underexplored. We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP. By incorporating CLIP's multi-prompt textual guidance, the fused supervision captures both dataset-specific and semantically enriched visual cues. Beyond accuracy, analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones. Moreover, fusion with CLIP refines the entire logit distribution, producing semantically meaningful probabilities for non-target classes, thereby improving inter-class consistency and distillation quality. Despite its simplicity, the proposed method, Enriching Knowledge Distillation (RichKD), consistently outperforms most existing baselines across multiple benchmarks and exhibits stronger robustness under distribution shifts and input corruptions.

Paper Structure

This paper contains 17 sections, 12 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Impact of cross-modal teacher fusion on CIFAR-100. (a) Effect of perturbing the logits of the conventional teacher with CLIP’s logits across four categories, considering cases where the teacher’s predictions are correct/incorrect and certain/uncertain. (b) Effect of fusion with CLIP for two sample cases: when the teacher is incorrect, and when the teacher is correct but uncertain.
  • Figure 2: Overall diagram of the proposed RichKD distillation method. CLIP’s logits and features are fused with those from the conventional teacher model. Feature and logit distillation losses are then defined between the fused representations and the student’s corresponding features and logits. During training phase, the parameters of CLIP and the teacher model are frozen, and the student is trained using feature and logit distillation losses in addition to the cross-entropy loss. Inconsistencies in feature dimensions are addressed through a linear layer transformation.
  • Figure 3: Impact of different types of prompting on CLIP’s zero-shot performance and the student’s performance. The teacher and student architectures are ResNet-32×4 and ResNet-8×4, respectively.
  • Figure 4: t-SNE visualization of features.
  • Figure 5: Inter-class correlation matrices on the CIFAR-100 dataset.