ViLAaD: Enhancing "Attracting and Dispersing'' Source-Free Domain Adaptation with Vision-and-Language Model

Shuhei Tarashima; Xinqi Shu; Norio Tagawa

ViLAaD: Enhancing "Attracting and Dispersing'' Source-Free Domain Adaptation with Vision-and-Language Model

Shuhei Tarashima, Xinqi Shu, Norio Tagawa

TL;DR

ViLAaD introduces a Vision-and-Language prior into the Attracting and Dispersing (AaD) framework for Source-Free Domain Adaptation, yielding a ViL-enhanced adaptation method. By leveraging ViL-derived predictions as a strong initialization, ViLAaD improves alignment of target samples with their ViL-neighbors and maintains dispersion for distant samples; ViLAaD++ further boosts performance with alternating ViL prompt tuning and additional objective terms for the target model. Extensive experiments on Office-31, Office-Home, VisDA-C, and DomainNet-126 show ViLAaD outperforms AaD and ViL zero-shot, while ViLAaD++ achieves state-of-the-art results across Closed-set, Partial-set, and Open-set SFDA settings, often with the best variants using specific ViL prompts. The work demonstrates practical benefits for deploying light-weight target models that can exploit ViL priors without heavy reliance on source data, and it highlights the complementary role of prompt tuning in cross-modal adaptation.

Abstract

Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to a target dataset from a different domain without access to the source data. Conventional SFDA methods are limited by the information encoded in the pre-trained source model and the unlabeled target data. Recently, approaches leveraging auxiliary resources have emerged, yet remain in their early stages, offering ample opportunities for research. In this work, we propose a novel method that incorporates auxiliary information by extending an existing SFDA framework using Vision-and-Language (ViL) models. Specifically, we build upon Attracting and Dispersing (AaD), a widely adopted SFDA technique, and generalize its core principle to naturally integrate ViL models as a powerful initialization for target adaptation. Our approach, called ViL-enhanced AaD (ViLAaD), preserves the simplicity and flexibility of the AaD framework, while leveraging ViL models to significantly boost adaptation performance. We validate our method through experiments using various ViL models, demonstrating that ViLAaD consistently outperforms both AaD and zero-shot classification by ViL models, especially when both the source model and ViL model provide strong initializations. Moreover, the flexibility of ViLAaD allows it to be seamlessly incorporated into an alternating optimization framework with ViL prompt tuning and extended with additional objectives for target model adaptation. Extensive experiments on four SFDA benchmarks show that this enhanced version, ViLAaD++, achieves state-of-the-art performance across multiple SFDA scenarios, including Closed-set SFDA, Partial-set SFDA, and Open-set SFDA.

ViLAaD: Enhancing "Attracting and Dispersing'' Source-Free Domain Adaptation with Vision-and-Language Model

TL;DR

Abstract

ViLAaD: Enhancing "Attracting and Dispersing'' Source-Free Domain Adaptation with Vision-and-Language Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)