Table of Contents
Fetching ...

ZoDi: Zero-Shot Domain Adaptation with Diffusion-Based Image Transfer

Hiroki Azuma, Yusuke Matsui, Atsuto Maki

TL;DR

The paper tackles domain shift in semantic segmentation when target-domain images are unavailable. It introduces ZoDi, a two-stage framework that combines diffusion-based zero-shot image transfer with similarity-based model adaptation to learn domain-robust representations, leveraging layout-to-image diffusion with stochastic inversion guided by segmentation maps. Key contributions include the firstzero-shot diffusion-based domain adaptation approach for segmentation, a backbone-agnostic design, and the ability to visualize generated target-domain images to estimate performance. Experiments on Cityscapes→ACDC/GTA5 across day-night, weather, and game-domain shifts show consistent gains over source-only baselines and competitive performance against CLIP-based and unsupervised DA methods, with ablations supporting the effectiveness of layout-aware transfer and feature-similarity training.

Abstract

Deep learning models achieve high accuracy in segmentation tasks among others, yet domain shift often degrades the models' performance, which can be critical in real-world scenarios where no target images are available. This paper proposes a zero-shot domain adaptation method based on diffusion models, called ZoDi, which is two-fold by the design: zero-shot image transfer and model adaptation. First, we utilize an off-the-shelf diffusion model to synthesize target-like images by transferring the domain of source images to the target domain. In this we specifically try to maintain the layout and content by utilising layout-to-image diffusion models with stochastic inversion. Secondly, we train the model using both source images and synthesized images with the original segmentation maps while maximizing the feature similarity of images from the two domains to learn domain-robust representations. Through experiments we show benefits of ZoDi in the task of image segmentation over state-of-the-art methods. It is also more applicable than existing CLIP-based methods because it assumes no specific backbone or models, and it enables to estimate the model's performance without target images by inspecting generated images. Our implementation will be publicly available.

ZoDi: Zero-Shot Domain Adaptation with Diffusion-Based Image Transfer

TL;DR

The paper tackles domain shift in semantic segmentation when target-domain images are unavailable. It introduces ZoDi, a two-stage framework that combines diffusion-based zero-shot image transfer with similarity-based model adaptation to learn domain-robust representations, leveraging layout-to-image diffusion with stochastic inversion guided by segmentation maps. Key contributions include the firstzero-shot diffusion-based domain adaptation approach for segmentation, a backbone-agnostic design, and the ability to visualize generated target-domain images to estimate performance. Experiments on Cityscapes→ACDC/GTA5 across day-night, weather, and game-domain shifts show consistent gains over source-only baselines and competitive performance against CLIP-based and unsupervised DA methods, with ablations supporting the effectiveness of layout-aware transfer and feature-similarity training.

Abstract

Deep learning models achieve high accuracy in segmentation tasks among others, yet domain shift often degrades the models' performance, which can be critical in real-world scenarios where no target images are available. This paper proposes a zero-shot domain adaptation method based on diffusion models, called ZoDi, which is two-fold by the design: zero-shot image transfer and model adaptation. First, we utilize an off-the-shelf diffusion model to synthesize target-like images by transferring the domain of source images to the target domain. In this we specifically try to maintain the layout and content by utilising layout-to-image diffusion models with stochastic inversion. Secondly, we train the model using both source images and synthesized images with the original segmentation maps while maximizing the feature similarity of images from the two domains to learn domain-robust representations. Through experiments we show benefits of ZoDi in the task of image segmentation over state-of-the-art methods. It is also more applicable than existing CLIP-based methods because it assumes no specific backbone or models, and it enables to estimate the model's performance without target images by inspecting generated images. Our implementation will be publicly available.
Paper Structure (18 sections, 9 equations, 8 figures, 4 tables)

This paper contains 18 sections, 9 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The proposed training scheme -- ZoDi. It comprises zero-shot image transfer and model adaptation; conducting image transfer to change the domain of original images to a target domain, and training a segmentation model using the transferred target-like images together with the original ones. See Fig.2 for a more detailed sketch.
  • Figure 2: The architecture of ZoDi. It consists of two components: zero-shot image transfer and model adaptation. First, for changing the domain of the original images, we design layout-to-image (L2I) diffusion models with stochastic inversion for zero-shot image transfer. We use the original segmentation maps and target prompt, e.g. "driving in $<$domain$>$". We then train the model with two losses: task loss and similarity loss.
  • Figure 3: Examples of generated images. Top: four original images from CityScapes. Bottom: generated images for each of the four by our zero-shot image transfer into five different domains.
  • Figure 4: The qualitative results by different methods. The results on each dataset (ACDC or GTA5) are shown. Compared with models trained only by source images (column 3), ZoDi (rightmost) delivers segmentation closer to the ground truth.
  • Figure 5: Ablation studies for our zero-shot image transfer method. Our method consisting of ControlNet and Stochastic Inversion is highly capable of changing the domain of the images as can be seen in the rightmost column. In contrast, the images generated without either ControlNet or Stochastic Inversion (column 2, column 3, or column 4) can collapse the original contents.
  • ...and 3 more figures