Table of Contents
Fetching ...

Source-Free Domain Adaptation with Frozen Multimodal Foundation Model

Song Tang, Wenxin Su, Mao Ye, Xiatian Zhu

TL;DR

This work tackles Source-Free Domain Adaptation by leveraging off-the-shelf vision-language foundation models (e.g., CLIP) to bridge the gap to unlabeled target data without access to source samples. The proposed DIFO framework alternates between task-specific customization of the ViL model through mutual-information-based prompt learning and memory-aware knowledge distillation to the target model, augmented by two regularizations: most-likely category encouragement and predictive consistency. Empirical results across Office-31, Office-Home, VisDA, and DomainNet-126 show that DIFO achieves state-of-the-art performance on closed-set SFDA and robust improvements on partial-set and open-set settings, outperforming CLIP-based zero-shot and prior SFDA methods. The approach demonstrates that enriching a SFDA pipeline with heterogeneous, task-tailored multimodal knowledge can significantly improve cross-domain generalization, with strong evidence from feature distribution analyses, MMD-based adaptation dynamics, and Grad-CAM visualizations.

Abstract

Source-Free Domain Adaptation (SFDA) aims to adapt a source model for a target domain, with only access to unlabeled target training data and the source model pre-trained on a supervised source domain. Relying on pseudo labeling and/or auxiliary supervision, conventional methods are inevitably error-prone. To mitigate this limitation, in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g.,CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory, as it is not specialized for this particular task but largely generic. To make it task specific, we propose a novel Distilling multimodal Foundation model(DIFO)approach. Specifically, DIFO alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner, (ii) Distilling the knowledge of this customized ViL model to the target model. For more fine-grained and reliable distillation, we further introduce two effective regularization terms, namely most-likely category encouragement and predictive consistency. Extensive experiments show that DIFO significantly outperforms the state-of-the-art alternatives. Code is here

Source-Free Domain Adaptation with Frozen Multimodal Foundation Model

TL;DR

This work tackles Source-Free Domain Adaptation by leveraging off-the-shelf vision-language foundation models (e.g., CLIP) to bridge the gap to unlabeled target data without access to source samples. The proposed DIFO framework alternates between task-specific customization of the ViL model through mutual-information-based prompt learning and memory-aware knowledge distillation to the target model, augmented by two regularizations: most-likely category encouragement and predictive consistency. Empirical results across Office-31, Office-Home, VisDA, and DomainNet-126 show that DIFO achieves state-of-the-art performance on closed-set SFDA and robust improvements on partial-set and open-set settings, outperforming CLIP-based zero-shot and prior SFDA methods. The approach demonstrates that enriching a SFDA pipeline with heterogeneous, task-tailored multimodal knowledge can significantly improve cross-domain generalization, with strong evidence from feature distribution analyses, MMD-based adaptation dynamics, and Grad-CAM visualizations.

Abstract

Source-Free Domain Adaptation (SFDA) aims to adapt a source model for a target domain, with only access to unlabeled target training data and the source model pre-trained on a supervised source domain. Relying on pseudo labeling and/or auxiliary supervision, conventional methods are inevitably error-prone. To mitigate this limitation, in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g.,CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory, as it is not specialized for this particular task but largely generic. To make it task specific, we propose a novel Distilling multimodal Foundation model(DIFO)approach. Specifically, DIFO alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner, (ii) Distilling the knowledge of this customized ViL model to the target model. For more fine-grained and reliable distillation, we further introduce two effective regularization terms, namely most-likely category encouragement and predictive consistency. Extensive experiments show that DIFO significantly outperforms the state-of-the-art alternatives. Code is here
Paper Structure (20 sections, 1 theorem, 11 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 20 sections, 1 theorem, 11 equations, 12 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Given two random variables $X$, $Y$. Their mutual information ${\rm{I}}\left( X, Y \right)$ and KL divergence $D_{\rm{KL}}\left( X||Y \right)$ satisfy the unequal relationship as follows.

Figures (12)

  • Figure 1: We expand beyond traditional SFDA methods that rely solely on a pretrained source model and unlabeled target data. Instead, we innovate by exploring off-the-shelf multimodal foundation models, such as CLIP, in an unsupervised manner (marked by the box with blue background).
  • Figure 2: Overview of our DIFO: The process involves two alternating steps. First, we perform (a) task-specific customization of a ViL model through task-specific prompt learning ($L_{\rm{TSC}}$). This is achieved under soft predictive guidance using mutual information maximization. Second, we undertake (b) memory-aware knowledge adaptation, incorporating two regularizations: most-likely category encouragement ($L_{\rm{MCE}}$) predicted by our dynamic memory-aware predictor, along with the tupical predictive consistency ($L_{\rm{PC}}$). These regularizations are designed to facilitate a coarse-to-fine adaptation.
  • Figure 3: Illustration of most-likely category encouragement. In contrast to the conventional approach that assigns equal importance to all categories (depicted by the gray line), our approach (represented by the black line) introduces additional supervision by incorporating extra knowledge about the two most likely categories.
  • Figure 4: The performance of the scheme directly weighting the source model and CLIP-B32. All results are normalized by corresponding DIFO-C-B32 accuracies for a clear view.
  • Figure 5: Feature distribution visualization comparison on transfer task Ar$\to$Cl in Office-Home. Oracle is trained on target domain Cl using the ground-truth labels. Different colors stand for different categories. Top: t-SNE feature distribution over 65 categories. Bottom: The corresponding 3D density charts. For easy view, the first 10 categories were used in this plot.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Theorem 1