Table of Contents
Fetching ...

Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization

Jixuan Leng, Yijiang Li, Haohan Wang

TL;DR

This work tackles domain generalization by introducing Selective Cross-Modality Distillation (SCMD), which distills from CLIP into a single-modal student using a hard-to-learn sample selector and a cross-modality projection that aligns student features with CLIP's text embeddings. The method combines supervised learning, a logits-KL distillation term, and a CLIP-guided cross-modal loss, with a theoretical bound showing that focusing on high-loss samples tightens robustness to distribution shift. Empirically, SCMD achieves state-of-the-art results on the DomainBed DG benchmarks, and ablations confirm that both the hard-sample selection and the cross-modality module contribute meaningfully to performance. The findings suggest that selective, cross-modal knowledge transfer is a practical and effective strategy for improving domain generalization in vision tasks.

Abstract

Domain Generalization (DG), a crucial research area, seeks to train models across multiple domains and test them on unseen ones. In this paper, we introduce a novel approach, namely, Selective Cross-Modality Distillation for Domain Generalization (SCMD). SCMD leverages the capabilities of large vision-language models, specifically CLIP, to train a more efficient model, ensuring it acquires robust generalization capabilities across unseen domains. Our primary contribution is a unique selection framework strategically designed to identify hard-to-learn samples for distillation. In parallel, we introduce a novel cross-modality module that seamlessly combines the projected features of the student model with the text embeddings from CLIP, ensuring the alignment of similarity distributions. We assess SCMD's performance on various benchmarks, where it empowers a ResNet50 to deliver state-of-the-art performance, surpassing existing domain generalization methods. Furthermore, we provide a theoretical analysis of our selection strategy, offering deeper insight into its effectiveness and potential in the field of DG.

Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization

TL;DR

This work tackles domain generalization by introducing Selective Cross-Modality Distillation (SCMD), which distills from CLIP into a single-modal student using a hard-to-learn sample selector and a cross-modality projection that aligns student features with CLIP's text embeddings. The method combines supervised learning, a logits-KL distillation term, and a CLIP-guided cross-modal loss, with a theoretical bound showing that focusing on high-loss samples tightens robustness to distribution shift. Empirically, SCMD achieves state-of-the-art results on the DomainBed DG benchmarks, and ablations confirm that both the hard-sample selection and the cross-modality module contribute meaningfully to performance. The findings suggest that selective, cross-modal knowledge transfer is a practical and effective strategy for improving domain generalization in vision tasks.

Abstract

Domain Generalization (DG), a crucial research area, seeks to train models across multiple domains and test them on unseen ones. In this paper, we introduce a novel approach, namely, Selective Cross-Modality Distillation for Domain Generalization (SCMD). SCMD leverages the capabilities of large vision-language models, specifically CLIP, to train a more efficient model, ensuring it acquires robust generalization capabilities across unseen domains. Our primary contribution is a unique selection framework strategically designed to identify hard-to-learn samples for distillation. In parallel, we introduce a novel cross-modality module that seamlessly combines the projected features of the student model with the text embeddings from CLIP, ensuring the alignment of similarity distributions. We assess SCMD's performance on various benchmarks, where it empowers a ResNet50 to deliver state-of-the-art performance, surpassing existing domain generalization methods. Furthermore, we provide a theoretical analysis of our selection strategy, offering deeper insight into its effectiveness and potential in the field of DG.
Paper Structure (35 sections, 6 theorems, 21 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 35 sections, 6 theorems, 21 equations, 4 figures, 9 tables, 1 algorithm.

Key Result

Lemma 4.1

Given Assumptions A1 such that there is a gold standard labeling function for source and target domains. For two arbitrary distributions ${\bm{P}}'$ and ${\bm{P}}$, where $tv$ denotes the total variation.

Figures (4)

  • Figure 1: SCMD that features a selection mechanism to focus on hard-to-learn samples and a cross-modality module that projects the student's feature into CLIP multi-modal space for alignment.
  • Figure 2: Evaluation of SCMD's performance across various student and CLIP model architectures on the PACS dataset
  • Figure 3: Sensitivity analysis for $\tau$ and $k$
  • Figure 4: Average step time per domain for different algorithms on PACS

Theorems & Definitions (9)

  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.3
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Lemma A.3
  • proof