Table of Contents
Fetching ...

Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification

Chenqi Guo, Mengshuo Rong, Qianli Feng, Rongfan Feng, Yinglong Ma

TL;DR

The paper tackles the challenge of improving unimodal image classification through crossmodal knowledge distillation by introducing a two-teacher framework that includes a unimodal image teacher and a multimodal CLIP-based teacher augmented with WordNet-relaxed text embeddings. A hierarchical loss and cosine regularization are proposed to align the relaxed text prompts with true class semantics while preventing drift from pretrained references, mitigating label leakage. Empirical results across six public datasets show consistent improvements, achieving state-of-the-art or near state-of-the-art student performance, with interpretability analyses indicating reduced reliance on textual shortcuts and stronger visual feature usage. The approach demonstrates the practical impact of richer, semantically grounded textual prompts in crossmodal KD, enabling robust knowledge transfer while preserving the unimodal nature of the student at inference.

Abstract

Crossmodal knowledge distillation (KD) aims to enhance a unimodal student using a multimodal teacher model. In particular, when the teacher's modalities include the student's, additional complementary information can be exploited to improve knowledge transfer. In supervised image classification, image datasets typically include class labels that represent high-level concepts, suggesting a natural avenue to incorporate textual cues for crossmodal KD. However, these labels rarely capture the deeper semantic structures in real-world visuals and can lead to label leakage if used directly as inputs, ultimately limiting KD performance. To address these issues, we propose a multi-teacher crossmodal KD framework that integrates CLIP image embeddings with learnable WordNet-relaxed text embeddings under a hierarchical loss. By avoiding direct use of exact class names and instead using semantically richer WordNet expansions, we mitigate label leakage and introduce more diverse textual cues. Experiments show that this strategy significantly boosts student performance, whereas noisy or overly precise text embeddings hinder distillation efficiency. Interpretability analyses confirm that WordNet-relaxed prompts encourage heavier reliance on visual features over textual shortcuts, while still effectively incorporating the newly introduced textual cues. Our method achieves state-of-the-art or second-best results on six public datasets, demonstrating its effectiveness in advancing crossmodal KD.

Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification

TL;DR

The paper tackles the challenge of improving unimodal image classification through crossmodal knowledge distillation by introducing a two-teacher framework that includes a unimodal image teacher and a multimodal CLIP-based teacher augmented with WordNet-relaxed text embeddings. A hierarchical loss and cosine regularization are proposed to align the relaxed text prompts with true class semantics while preventing drift from pretrained references, mitigating label leakage. Empirical results across six public datasets show consistent improvements, achieving state-of-the-art or near state-of-the-art student performance, with interpretability analyses indicating reduced reliance on textual shortcuts and stronger visual feature usage. The approach demonstrates the practical impact of richer, semantically grounded textual prompts in crossmodal KD, enabling robust knowledge transfer while preserving the unimodal nature of the student at inference.

Abstract

Crossmodal knowledge distillation (KD) aims to enhance a unimodal student using a multimodal teacher model. In particular, when the teacher's modalities include the student's, additional complementary information can be exploited to improve knowledge transfer. In supervised image classification, image datasets typically include class labels that represent high-level concepts, suggesting a natural avenue to incorporate textual cues for crossmodal KD. However, these labels rarely capture the deeper semantic structures in real-world visuals and can lead to label leakage if used directly as inputs, ultimately limiting KD performance. To address these issues, we propose a multi-teacher crossmodal KD framework that integrates CLIP image embeddings with learnable WordNet-relaxed text embeddings under a hierarchical loss. By avoiding direct use of exact class names and instead using semantically richer WordNet expansions, we mitigate label leakage and introduce more diverse textual cues. Experiments show that this strategy significantly boosts student performance, whereas noisy or overly precise text embeddings hinder distillation efficiency. Interpretability analyses confirm that WordNet-relaxed prompts encourage heavier reliance on visual features over textual shortcuts, while still effectively incorporating the newly introduced textual cues. Our method achieves state-of-the-art or second-best results on six public datasets, demonstrating its effectiveness in advancing crossmodal KD.

Paper Structure

This paper contains 21 sections, 6 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison of different KD frameworks for supervised image classification. (a) Vanilla KD: Teacher and student share the same modality (visual), providing limited training cues. (b) Conventional acrossmodal KD: Knowledge is transferred from one modality (e.g., text) in the teacher to another modality (e.g., image) in the student, but suffers from modality gaps and insufficient general modality features in the teacher. (c) Our multi-teacher crossmodal KD with WordNet-relaxation: Combines both image and CLIP-based multimodal embeddings in the teacher, providing richer information to the student.
  • Figure 2: Teacher $\text{T}_x$ and Student $\text{S}_s$ top-1 validation accuracy under KD on CIFAR100, across varying proportions of CLIP WordNet-relaxed text embeddings. Note that $\text{T}_{x}$ uses only CLIP text embeddings as inputs here. The proportion refers to the ratio of training samples using WordNet-relaxed text embeddings versus those using ground truth class-name-based ones. As the proportion increases, the teacher’s validation accuracy decreases, indicating that the classification task becomes more challenging. In contrast, the student performance improves with a higher proportion of WordNet-relaxed text embeddings, highlighting the regularization benefits of incorporating more diverse semantic cues.
  • Figure 3: Captum feature attribution (text vs. image contributions) for teacher $\text{T}_x$ on CIFAR100, with varying proportions of Left: CLIP noisy text embeddings and Right: CLIP WordNet-relaxed text embeddings. In each set of trials, increasing the noise or WordNet ratio diminishes reliance on direct class tokens, pushing the teacher to depend more on general visual modality features (i.e., CLIP image embeddings) and reducing deceptive "shortcuts". Consequently, teachers with 100% noise or 100% WordNet text produce the best student accuracy (see Table \ref{['tab:TS_concat_clips__ablation_TS_noise']}). Meanwhile, WordNet expansions preserve semantic consistency, enabling the teacher to leverage textual cues more effectively than pure noise, thereby improving crossmodal KD performance.