Table of Contents
Fetching ...

ViLAaD: Enhancing "Attracting and Dispersing'' Source-Free Domain Adaptation with Vision-and-Language Model

Shuhei Tarashima, Xinqi Shu, Norio Tagawa

TL;DR

ViLAaD introduces a Vision-and-Language prior into the Attracting and Dispersing (AaD) framework for Source-Free Domain Adaptation, yielding a ViL-enhanced adaptation method. By leveraging ViL-derived predictions as a strong initialization, ViLAaD improves alignment of target samples with their ViL-neighbors and maintains dispersion for distant samples; ViLAaD++ further boosts performance with alternating ViL prompt tuning and additional objective terms for the target model. Extensive experiments on Office-31, Office-Home, VisDA-C, and DomainNet-126 show ViLAaD outperforms AaD and ViL zero-shot, while ViLAaD++ achieves state-of-the-art results across Closed-set, Partial-set, and Open-set SFDA settings, often with the best variants using specific ViL prompts. The work demonstrates practical benefits for deploying light-weight target models that can exploit ViL priors without heavy reliance on source data, and it highlights the complementary role of prompt tuning in cross-modal adaptation.

Abstract

Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to a target dataset from a different domain without access to the source data. Conventional SFDA methods are limited by the information encoded in the pre-trained source model and the unlabeled target data. Recently, approaches leveraging auxiliary resources have emerged, yet remain in their early stages, offering ample opportunities for research. In this work, we propose a novel method that incorporates auxiliary information by extending an existing SFDA framework using Vision-and-Language (ViL) models. Specifically, we build upon Attracting and Dispersing (AaD), a widely adopted SFDA technique, and generalize its core principle to naturally integrate ViL models as a powerful initialization for target adaptation. Our approach, called ViL-enhanced AaD (ViLAaD), preserves the simplicity and flexibility of the AaD framework, while leveraging ViL models to significantly boost adaptation performance. We validate our method through experiments using various ViL models, demonstrating that ViLAaD consistently outperforms both AaD and zero-shot classification by ViL models, especially when both the source model and ViL model provide strong initializations. Moreover, the flexibility of ViLAaD allows it to be seamlessly incorporated into an alternating optimization framework with ViL prompt tuning and extended with additional objectives for target model adaptation. Extensive experiments on four SFDA benchmarks show that this enhanced version, ViLAaD++, achieves state-of-the-art performance across multiple SFDA scenarios, including Closed-set SFDA, Partial-set SFDA, and Open-set SFDA.

ViLAaD: Enhancing "Attracting and Dispersing'' Source-Free Domain Adaptation with Vision-and-Language Model

TL;DR

ViLAaD introduces a Vision-and-Language prior into the Attracting and Dispersing (AaD) framework for Source-Free Domain Adaptation, yielding a ViL-enhanced adaptation method. By leveraging ViL-derived predictions as a strong initialization, ViLAaD improves alignment of target samples with their ViL-neighbors and maintains dispersion for distant samples; ViLAaD++ further boosts performance with alternating ViL prompt tuning and additional objective terms for the target model. Extensive experiments on Office-31, Office-Home, VisDA-C, and DomainNet-126 show ViLAaD outperforms AaD and ViL zero-shot, while ViLAaD++ achieves state-of-the-art results across Closed-set, Partial-set, and Open-set SFDA settings, often with the best variants using specific ViL prompts. The work demonstrates practical benefits for deploying light-weight target models that can exploit ViL priors without heavy reliance on source data, and it highlights the complementary role of prompt tuning in cross-modal adaptation.

Abstract

Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to a target dataset from a different domain without access to the source data. Conventional SFDA methods are limited by the information encoded in the pre-trained source model and the unlabeled target data. Recently, approaches leveraging auxiliary resources have emerged, yet remain in their early stages, offering ample opportunities for research. In this work, we propose a novel method that incorporates auxiliary information by extending an existing SFDA framework using Vision-and-Language (ViL) models. Specifically, we build upon Attracting and Dispersing (AaD), a widely adopted SFDA technique, and generalize its core principle to naturally integrate ViL models as a powerful initialization for target adaptation. Our approach, called ViL-enhanced AaD (ViLAaD), preserves the simplicity and flexibility of the AaD framework, while leveraging ViL models to significantly boost adaptation performance. We validate our method through experiments using various ViL models, demonstrating that ViLAaD consistently outperforms both AaD and zero-shot classification by ViL models, especially when both the source model and ViL model provide strong initializations. Moreover, the flexibility of ViLAaD allows it to be seamlessly incorporated into an alternating optimization framework with ViL prompt tuning and extended with additional objectives for target model adaptation. Extensive experiments on four SFDA benchmarks show that this enhanced version, ViLAaD++, achieves state-of-the-art performance across multiple SFDA scenarios, including Closed-set SFDA, Partial-set SFDA, and Open-set SFDA.

Paper Structure

This paper contains 18 sections, 12 equations, 5 figures, 9 tables, 2 algorithms.

Figures (5)

  • Figure 1: Suppose we have a domain adaptation (DA) model and a target dataset. As illustrated in (a), for the $i$-th target example, the $j$-th and $k$-th examples are part of its closed neighbor set $\mathcal{C}_{i}$ in the feature space of the DA model, while the $l$-th example belongs to its complementary set $\bar{\mathcal{C}}_{i}$. AaD yang+2022neurips encourages the DA model to produce similar predictions for the $i$-th example and its neighbors in $\mathcal{C}_{i}$, while pushing apart the predictions for the $i$-th example and those in $\bar{\mathcal{C}}_{i}$ (as shown in (b)). In contrast, our proposed ViLAaD method enhances the alignment between the DA model’s prediction for the $i$-th example and the Vision-and-Language (ViL) model’s predictions for its neighbors in $\mathcal{C}_{i}$, while still enforcing dissimilarity with the examples in $\bar{\mathcal{C}}_{i}$ (see (c)).
  • Figure 2: ViLAaD / ViLAaD++ vs. baselines with different ViL models i.e., CLIP-ViT-B/32 (C-B32) radford+2021arxiv, CLIP-ViT-L/14 (C-L14) radford+2021arxiv, ALBEF (AL) li+2021neurips and BLIP (BL) li+2022icml on the Offce-31 (O31), Offiece-Home (OH) and VisDA-C (VisDA) datasets.
  • Figure 3: Confusion matrices for (a) the source model's predictions, (b) zero-shot classification using ALBEF li+2021neurips, and ViLAaD with ALBEF on the VisDA-C dataset. For the "car" class, ZSC achieves an accuracy of 55.0%, which is lower than that of Source (67.9%). As a result, ViLAaD reaches 60.3% accuracy, which is below both Source and AaD (76.2%, see Table \ref{['tab:eval:closed:visda']}).
  • Figure 7: Accuracies of ViLAaD (left) and ViLAaD++ (right) with respect to the number of epochs in the AP scenario on the Office-Home dataset. C-B32 is used to generate the results. Note that in ViLAaD, the accuracy of zero-shot classification (ZSC) by the ViL model remains unchanged throughout training, as both the ViL model and its text prompt are kept frozen. In contrast, ViLAaD++ shows improved ZSC accuracy over time, as its text prompt is jointly tuned during adaptation.
  • Figure 8: t-SNE visualizations of the predicted probability distributions ( i.e., $p$ or $q$) for (a) source model classification, (b) zero-shot classification (ZSC) using a ViL model without prompt tuning, (c) ViLAaD, (d) ZSC using a ViL model with prompt tuning, and (e) ViLAaD++ in the CP scenario on the Office-Home dataset. In all cases, C-B32 is used as the underlying ViL model. Different colors correspond to different classes.