Table of Contents
Fetching ...

Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation

Zhekai Du, Xinyao Li, Fengling Li, Ke Lu, Lei Zhu, Jingjing Li

TL;DR

Domain-Agnostic Mutual Prompting (DAMP) is proposed to exploit domain-invariant semantics by mutually aligning visual and textual embeddings to exploit domain-invariant semantics.

Abstract

Conventional Unsupervised Domain Adaptation (UDA) strives to minimize distribution discrepancy between domains, which neglects to harness rich semantics from data and struggles to handle complex domain shifts. A promising technique is to leverage the knowledge of large-scale pre-trained vision-language models for more guided adaptation. Despite some endeavors, current methods often learn textual prompts to embed domain semantics for source and target domains separately and perform classification within each domain, limiting cross-domain knowledge transfer. Moreover, prompting only the language branch lacks flexibility to adapt both modalities dynamically. To bridge this gap, we propose Domain-Agnostic Mutual Prompting (DAMP) to exploit domain-invariant semantics by mutually aligning visual and textual embeddings. Specifically, the image contextual information is utilized to prompt the language branch in a domain-agnostic and instance-conditioned way. Meanwhile, visual prompts are imposed based on the domain-agnostic textual prompt to elicit domain-invariant visual embeddings. These two branches of prompts are learned mutually with a cross-attention module and regularized with a semantic-consistency loss and an instance-discrimination contrastive loss. Experiments on three UDA benchmarks demonstrate the superiority of DAMP over state-of-the-art approaches.

Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation

TL;DR

Domain-Agnostic Mutual Prompting (DAMP) is proposed to exploit domain-invariant semantics by mutually aligning visual and textual embeddings to exploit domain-invariant semantics.

Abstract

Conventional Unsupervised Domain Adaptation (UDA) strives to minimize distribution discrepancy between domains, which neglects to harness rich semantics from data and struggles to handle complex domain shifts. A promising technique is to leverage the knowledge of large-scale pre-trained vision-language models for more guided adaptation. Despite some endeavors, current methods often learn textual prompts to embed domain semantics for source and target domains separately and perform classification within each domain, limiting cross-domain knowledge transfer. Moreover, prompting only the language branch lacks flexibility to adapt both modalities dynamically. To bridge this gap, we propose Domain-Agnostic Mutual Prompting (DAMP) to exploit domain-invariant semantics by mutually aligning visual and textual embeddings. Specifically, the image contextual information is utilized to prompt the language branch in a domain-agnostic and instance-conditioned way. Meanwhile, visual prompts are imposed based on the domain-agnostic textual prompt to elicit domain-invariant visual embeddings. These two branches of prompts are learned mutually with a cross-attention module and regularized with a semantic-consistency loss and an instance-discrimination contrastive loss. Experiments on three UDA benchmarks demonstrate the superiority of DAMP over state-of-the-art approaches.
Paper Structure (20 sections, 22 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 20 sections, 22 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Top: exsiting prompt-based methods (e.g., DAPrompt ge2022domain) only learn textual prompts to embed semantics for each domain and perform classification separately, which limits cross-domain knowledge transfer and feature alignment. Bottom: our method learns both textual and visual prompts mutually to make both modalities of embeddings domain-invariant, thus enabling better utilization of source knowledge and flexibility alignment.
  • Figure 2: Overview of the proposed DAMP framework. Parameters of $f_s$ and $f_v$ are frozen and only $\bm{p}_{1:N}$ and $G$ are tunable during training. The blue arrows represent text data flows, while the green and purple arrows are data flows for source and target images, respectively. We only depict the prompting process for source weakly augmentated samples. All other samples follow the same process. $\mathcal{L}_{sc}^{s}$ ($\mathcal{L}_{sc}^{t}$), $\mathcal{L}_{idc}^{s}$ ($\mathcal{L}_{idc}^{t}$), and $\mathcal{L}_{im}$ are regularizations to make the prompting domain-agnostic, instance-conditioned and semantic-compatible, respectively.
  • Figure 3: Visualization of (a) visual embeddings and (b) textual embeddings using t-SNE van2008visualizing on task Ar $\rightarrow$ Pr (Office-Home). Light and dark colors represent embeddings before and after our mutual prompting, respectively. Red and blue points are source and target samples, respectively. Orange stars denote the class-level domain-agnostic textual embeddings $\{\bm{s}_k\}_{k=1}^{K}$.
  • Figure 4: Comparasion between different UDA methods regarding tunable parameters and accuracies on VisDA-17 (ResNet-101). DAMP only use 11.9% parameters compared with PADCLIP.
  • Figure 5: Hyperparameter analysis. (a) Performance under different learnable token length $N$ on VisDA-17 dataset. (b) Values of learnable hyperparameters $\gamma_v$ and $\gamma_t$ during training on task Cl $\rightarrow$ Sk (Mini-DomainNet). (c) The influence of different choices of $T$ on Office-Home dataset. (d) parameter sensitivities of $\lambda_c$ and $\lambda_i$ on task Cl $\rightarrow$ Ar (Office-Home).
  • ...and 2 more figures