Table of Contents
Fetching ...

Domain Adaptation with a Single Vision-Language Embedding

Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette

TL;DR

This paper presents a new framework for domain adaptation relying on a single Vision-Language (VL) latent embedding instead of full target data, and proposes prompt/photo-driven instance normalization (PIN), a feature augmentation method that mines multiple visual styles using a single target VL latent embedding.

Abstract

Domain adaptation has been extensively investigated in computer vision but still requires access to target data at the training time, which might be difficult to obtain in some uncommon conditions. In this paper, we present a new framework for domain adaptation relying on a single Vision-Language (VL) latent embedding instead of full target data. First, leveraging a contrastive language-image pre-training model (CLIP), we propose prompt/photo-driven instance normalization (PIN). PIN is a feature augmentation method that mines multiple visual styles using a single target VL latent embedding, by optimizing affine transformations of low-level source features. The VL embedding can come from a language prompt describing the target domain, a partially optimized language prompt, or a single unlabeled target image. Second, we show that these mined styles (i.e., augmentations) can be used for zero-shot (i.e., target-free) and one-shot unsupervised domain adaptation. Experiments on semantic segmentation demonstrate the effectiveness of the proposed method, which outperforms relevant baselines in the zero-shot and one-shot settings.

Domain Adaptation with a Single Vision-Language Embedding

TL;DR

This paper presents a new framework for domain adaptation relying on a single Vision-Language (VL) latent embedding instead of full target data, and proposes prompt/photo-driven instance normalization (PIN), a feature augmentation method that mines multiple visual styles using a single target VL latent embedding.

Abstract

Domain adaptation has been extensively investigated in computer vision but still requires access to target data at the training time, which might be difficult to obtain in some uncommon conditions. In this paper, we present a new framework for domain adaptation relying on a single Vision-Language (VL) latent embedding instead of full target data. First, leveraging a contrastive language-image pre-training model (CLIP), we propose prompt/photo-driven instance normalization (PIN). PIN is a feature augmentation method that mines multiple visual styles using a single target VL latent embedding, by optimizing affine transformations of low-level source features. The VL embedding can come from a language prompt describing the target domain, a partially optimized language prompt, or a single unlabeled target image. Second, we show that these mined styles (i.e., augmentations) can be used for zero-shot (i.e., target-free) and one-shot unsupervised domain adaptation. Experiments on semantic segmentation demonstrate the effectiveness of the proposed method, which outperforms relevant baselines in the zero-shot and one-shot settings.

Paper Structure

This paper contains 19 sections, 12 equations, 10 figures, 13 tables, 2 algorithms.

Figures (10)

  • Figure 1: Domain adaptation with a single VL embedding. The proposed framework enables the adaptation of a segmenter model (here, DeepLabv3+ trained on the source dataset Cityscapes) to unseen conditions with only one embedding vector in shared VL space. (top) PØDA leverages a single text prompt. (bottom-left) PØDA-concept utilizes a prompt where the concept $\mathsf{S}^{*}{}$ is optimized from the source images and conditions remain textually described. (bottom-right) PIDA adapts the model using a single unlabeled target image. Source-only predictions are shown as smaller segmentation masks to the left or right of the test images.
  • Figure 2: Overview our framework of domain adaptation using a single VL embedding. (Left) Using only a single VL embedding representing a target domain (i.e., target prompt + or target image $\blacktriangle$), we leverage a frozen ResNet encoder with CLIP weights to optimize $\text{source}{\shortrightarrow}\text{target}$ low-level feature affine transformations saved in a style bank. (Middle) Zero-shot/One-shot unsupervised domain adaptation is achieved by fine-tuning a segmenter model ($M$) on features that are augmented using the learned transformations, here $\color{darkblue}\textbf{f}_{\text{s}\shortrightarrow\text{night}}$. (Right) This enables inference on target domains.
  • Figure 3: Concept optimization. A <concept> is optimized in the word embedding space by means of cosine distance $\color{darkred}\mathcal{L}_{\texttt{<concept>}}$, such that the text embedding gets closer to source image embeddings. The final value of this optimizable word embedding is denoted $\mathsf{S}^*$.
  • Figure 4: Target style mining from a source image. We illustrate here the optimization loop of \ref{['algo:style_mining']}. The source image is forwarded through the CLIP image encoder $E_\text{img}$ to extract low-level features $\color{srccolor} {\mathbf{f}}_\text{s}$ and subsequent CLIP embedding $\color{srccolor} {\bar{\mathbf{f}}}_\text{s}$. At each optimization step $i$, $\texttt{augment}(\cdot)$ takes the style of the previous iteration, $\color{intermcolor}(\boldsymbol{\mu}^{i-1}\!\!,\boldsymbol{\sigma}^{i-1})$ and injects it within $\color{srccolor} {\mathbf{f}}_\text{s}$ via the PIN layer, to synthesize $\color{intermcolor}\textbf{f}^i_{\text{s}\shortrightarrow\text{t}}$ and the corresponding embedding $\color{intermcolor}\bar{\textbf{f}}^i_{\text{s}\shortrightarrow\text{t}}$. The loss $\color{darkred}\mathcal{L}_{\boldsymbol{\mu}, \boldsymbol{\sigma}}$ is the cosine distance between $\color{intermcolor}\bar{\textbf{f}}^i_{\text{s}\shortrightarrow\text{t}}$ and the target embedding ${\bar{\mathbf{f}}}_\text{t}$, which can be derived from a prompt (e.g., "driving at night"), a partially optimized prompt (e.g., "$\mathsf{S}^*$ at night"), or an unlabeled target image. Its optimization via gradient descent updates style to $\textcolor{intermcolor}{($\boldsymbol{\mu}$^{i}\!,$\boldsymbol{\sigma}$^{i})}$.
  • Figure 5: CLIPstyler kwon2022clipstyler stylization. A sample Cityscapes image stylized using adhoc target prompts. Translated images exhibit visible artifacts, potentially harming adaptation, e.g., rain in \ref{['tab:main_results']}.
  • ...and 5 more figures