Table of Contents
Fetching ...

LanDA: Language-Guided Multi-Source Domain Adaptation

Zhenbin Wang, Lei Zhang, Lituan Wang, Minjuan Zhu

TL;DR

LanDA tackles multi-source domain adaptation without target-domain images by exploiting a visual-language foundation model and language-described target domains. It introduces domain-specific augmenters to map each source domain into extended domains, followed by a Wasserstein-distance-based alignment that incorporates inter-class text information to extract domain-invariant features. A linear classifier is trained on both original and augmented embeddings, and target predictions are formed by a weighted combination of extended-domain predictions guided by text-based domain similarities. The approach yields state-of-the-art results on MSDA benchmarks while using significantly fewer trainable parameters than traditional MSDA methods, highlighting the potential of language guidance and OT-based alignment in multimodal settings.

Abstract

Multi-Source Domain Adaptation (MSDA) aims to mitigate changes in data distribution when transferring knowledge from multiple labeled source domains to an unlabeled target domain. However, existing MSDA techniques assume target domain images are available, yet overlook image-rich semantic information. Consequently, an open question is whether MSDA can be guided solely by textual cues in the absence of target domain images. By employing a multimodal model with a joint image and language embedding space, we propose a novel language-guided MSDA approach, termed LanDA, based on optimal transfer theory, which facilitates the transfer of multiple source domains to a new target domain, requiring only a textual description of the target domain without needing even a single target domain image, while retaining task-relevant information. We present extensive experiments across different transfer scenarios using a suite of relevant benchmarks, demonstrating that LanDA outperforms standard fine-tuning and ensemble approaches in both target and source domains.

LanDA: Language-Guided Multi-Source Domain Adaptation

TL;DR

LanDA tackles multi-source domain adaptation without target-domain images by exploiting a visual-language foundation model and language-described target domains. It introduces domain-specific augmenters to map each source domain into extended domains, followed by a Wasserstein-distance-based alignment that incorporates inter-class text information to extract domain-invariant features. A linear classifier is trained on both original and augmented embeddings, and target predictions are formed by a weighted combination of extended-domain predictions guided by text-based domain similarities. The approach yields state-of-the-art results on MSDA benchmarks while using significantly fewer trainable parameters than traditional MSDA methods, highlighting the potential of language guidance and OT-based alignment in multimodal settings.

Abstract

Multi-Source Domain Adaptation (MSDA) aims to mitigate changes in data distribution when transferring knowledge from multiple labeled source domains to an unlabeled target domain. However, existing MSDA techniques assume target domain images are available, yet overlook image-rich semantic information. Consequently, an open question is whether MSDA can be guided solely by textual cues in the absence of target domain images. By employing a multimodal model with a joint image and language embedding space, we propose a novel language-guided MSDA approach, termed LanDA, based on optimal transfer theory, which facilitates the transfer of multiple source domains to a new target domain, requiring only a textual description of the target domain without needing even a single target domain image, while retaining task-relevant information. We present extensive experiments across different transfer scenarios using a suite of relevant benchmarks, demonstrating that LanDA outperforms standard fine-tuning and ensemble approaches in both target and source domains.
Paper Structure (32 sections, 3 theorems, 21 equations, 8 figures, 10 tables)

This paper contains 32 sections, 3 theorems, 21 equations, 8 figures, 10 tables.

Key Result

Theorem 1

Let $\epsilon_{T}(f_{\mathrm{aug}})$ and $\epsilon_{S}(f_{\mathrm{aug}})$ represent the target domain error and source domain error, respectively. $\{\hat{Q}_k\}_{k=1}^{N}$ denotes the associated empirical measure from $\{\bar{Q}_k\}_{k=1}^{N}$. The kernel function $\varphi(\cdot,\cdot)$ solely depe

Figures (8)

  • Figure 1: An overview of the proposed two-stage framework. The training dataset comprises images from multiple source domains and various categories. In the absence of any image samples from the target domain, our goal is to generalize our model to the unseen clipart domain. To achieve this, we employ VLFMs and learn domain-level knowledge from the text. Each domain-specific augmenter utilizes the domain-class alignment loss to align the image embeddings from the source domain to the unseen target domain, while preserving their class information. To effectively leverage the knowledge from multiple source domains, we propose a cost matrix function that projects the extended domains and text embeddings of class name into the Wasserstein space to learn domain-invariant information.
  • Figure 2: t-SNE visualizations on the Office-Home dataset for transfer task A,C,R→P. In figure (a), the image embeddings of CLIP are depicted, with different colors indicating distinct classes. The remaining figure represent the extended domains, where each color corresponds to a different extended domain image embeddings. Three approaches to train domain-specific augmenters are compared: (b) using only $\mathcal{L}_{DA}$; (c) using $\mathcal{L}_{DA}$ and $\mathcal{L}_{CA}$; and (d) using $\mathcal{L}_{DA}$, $\mathcal{L}_{CA}$ and $\mathcal{L}_{DC}$.
  • Figure 3: Image generation and nearest neighbor results of different methods in different scenarios. (a) the source domain images in the order of real, painting, sketch and clipart, they are then transferred to painting, sketch, clipart and real domains, respectively. (b) and (d) display the target domain image generation results of VQGAN+CLIP and Diffusion+CLIP methods. For both methods, text prompts incorporate the target domain name and class name (e.g., a painting of a dog), and image prompts (taken from (a)) are inputted to generate the "target style" images. (c) and (e) present the nearest neighbor results of the generate images within the correct target domain in the CLIP image embedding space. (f) displays the nearest neighbor result obtained from the LADS method within the correct target domain in its augmented embedding space. (g) shows the nearest neighbor result obtained from our method within the correct target domain in the combine extended domain space. We have also indicated the distance of the nearest neighbor.
  • Figure 4: Nearest Neighbors for LanDA when ablating the loss. The labels below each image represent their respective categories. Images outlined in red have been augmented to error categories. Solely training using $\mathcal{L}_{DA}$ easily result in misclassification, transferring the image to the wrong category. Even though combining $\mathcal{L}_{DA}$ and $\mathcal{L}_{CA}$ helps reduce classification errors, it still contains class-independent information. Our comprehensive method specializes in learning domain-invariant features.
  • Figure 5: t-SNE visualizations on the Mini-DomainNet and Office-Home datasets.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Definition 1
  • Definition 2
  • Lemma 1
  • Theorem 2