LanDA: Language-Guided Multi-Source Domain Adaptation
Zhenbin Wang, Lei Zhang, Lituan Wang, Minjuan Zhu
TL;DR
LanDA tackles multi-source domain adaptation without target-domain images by exploiting a visual-language foundation model and language-described target domains. It introduces domain-specific augmenters to map each source domain into extended domains, followed by a Wasserstein-distance-based alignment that incorporates inter-class text information to extract domain-invariant features. A linear classifier is trained on both original and augmented embeddings, and target predictions are formed by a weighted combination of extended-domain predictions guided by text-based domain similarities. The approach yields state-of-the-art results on MSDA benchmarks while using significantly fewer trainable parameters than traditional MSDA methods, highlighting the potential of language guidance and OT-based alignment in multimodal settings.
Abstract
Multi-Source Domain Adaptation (MSDA) aims to mitigate changes in data distribution when transferring knowledge from multiple labeled source domains to an unlabeled target domain. However, existing MSDA techniques assume target domain images are available, yet overlook image-rich semantic information. Consequently, an open question is whether MSDA can be guided solely by textual cues in the absence of target domain images. By employing a multimodal model with a joint image and language embedding space, we propose a novel language-guided MSDA approach, termed LanDA, based on optimal transfer theory, which facilitates the transfer of multiple source domains to a new target domain, requiring only a textual description of the target domain without needing even a single target domain image, while retaining task-relevant information. We present extensive experiments across different transfer scenarios using a suite of relevant benchmarks, demonstrating that LanDA outperforms standard fine-tuning and ensemble approaches in both target and source domains.
