Table of Contents
Fetching ...

Adversarially Domain-adaptive Latent Diffusion for Unsupervised Semantic Segmentation

Jongmin Yu, Zhongtian Sun, Chen Bene Chi, Jinhong Yang, Shan Luo

TL;DR

This work introduces a semantic segmentation method based on latent diffusion models, termed Inter-Coder Connected Latent Diffusion (ICCLD), which employs an inter-coder connection to enhance contextual understanding and preserve fine details, while adversarial learning aligns latent feature distributions across domains during the latent diffusion process.

Abstract

Semantic segmentation requires extensive pixel-level annotation, motivating unsupervised domain adaptation (UDA) to transfer knowledge from labelled source domains to unlabelled or weakly labelled target domains. One of the most efficient strategies involves using synthetic datasets generated within controlled virtual environments, such as video games or traffic simulators, which can automatically generate pixel-level annotations. However, even when such datasets are available, learning a well-generalised representation that captures both domains remains challenging, owing to probabilistic and geometric discrepancies between the virtual world and real-world imagery. This work introduces a semantic segmentation method based on latent diffusion models, termed Inter-Coder Connected Latent Diffusion (ICCLD), alongside an unsupervised domain adaptation approach. The model employs an inter-coder connection to enhance contextual understanding and preserve fine details, while adversarial learning aligns latent feature distributions across domains during the latent diffusion process. Experiments on GTA5, Synthia, and Cityscapes demonstrate that ICCLD outperforms state-of-the-art UDA methods, achieving mIoU scores of 74.4 (GTA5$\rightarrow$Cityscapes) and 67.2 (Synthia$\rightarrow$Cityscapes).

Adversarially Domain-adaptive Latent Diffusion for Unsupervised Semantic Segmentation

TL;DR

This work introduces a semantic segmentation method based on latent diffusion models, termed Inter-Coder Connected Latent Diffusion (ICCLD), which employs an inter-coder connection to enhance contextual understanding and preserve fine details, while adversarial learning aligns latent feature distributions across domains during the latent diffusion process.

Abstract

Semantic segmentation requires extensive pixel-level annotation, motivating unsupervised domain adaptation (UDA) to transfer knowledge from labelled source domains to unlabelled or weakly labelled target domains. One of the most efficient strategies involves using synthetic datasets generated within controlled virtual environments, such as video games or traffic simulators, which can automatically generate pixel-level annotations. However, even when such datasets are available, learning a well-generalised representation that captures both domains remains challenging, owing to probabilistic and geometric discrepancies between the virtual world and real-world imagery. This work introduces a semantic segmentation method based on latent diffusion models, termed Inter-Coder Connected Latent Diffusion (ICCLD), alongside an unsupervised domain adaptation approach. The model employs an inter-coder connection to enhance contextual understanding and preserve fine details, while adversarial learning aligns latent feature distributions across domains during the latent diffusion process. Experiments on GTA5, Synthia, and Cityscapes demonstrate that ICCLD outperforms state-of-the-art UDA methods, achieving mIoU scores of 74.4 (GTA5Cityscapes) and 67.2 (SynthiaCityscapes).

Paper Structure

This paper contains 15 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of the workflow for the two-step domain adaptation process using the proposed Inter-Coder Connected Latent Diffusion (ICCLD) framework. In the first stage, domain adaptation is performed on the encoder $\mathcal{E}$ and decoder $\mathcal{D}$ of ICCLD using a segmentation-based approach. ClassMix method olsson2021classmix is used as a data augmentation method by generating mixed images $x^{s+t}$ and labels $y^{s+t}$. In the second stage, adversarial learning is applied to the denoising UNet $\epsilon_{\theta}$ for further domain alignment. During both stages, the loss functions are primarily computed using the teacher model, while the student model is updated via an exponential moving average (EMA) of the teacher model parameters.
  • Figure 2: Architectural details of ICCLD. For prediction a segmentation mask $\bar{y}^{t}$, ICCLD samples a noise vector $\bar{z}_{T}$ from Gaussian distribution $\mathcal{N}(0,1)$, and extracts a latent vector $z^{t}_{0}$ from a given image $x^{t}$. $\bar{z}_{T}$ and $z^{t}_{0}$ are concatenated and applied to the de-noising UNet $\epsilon_{\theta}$ for generating a latent feature vector $\bar{z}^{t}_{0}$. The de-noising process is repeated for $T$ times to obtain $\bar{z}^{t}_{0}$. The decoder $\mathcal{D}$ takes $\bar{z}^{t}_{0}$ as an input and predicts $\bar{y}^{t}$.
  • Figure 3: Example images and labels of the (a) GTA-5 richter2016playing, (b) SYNTHIA ros2016synthia, and (c)Cityscapes datasets cordts2016Cityscapes.
  • Figure 4: Qualitative comparison of the UDA performance of ICCLD according to the inter-coder connection and the usage of the extra DA process using adversarial learning. The quantitative results highlighted by green-coloured dotted boxes show that the inter-coder connections and the additional DA process improve the quality of segmentation results by reducing false-positive results.
  • Figure 5: Qualitative comparison on the UDA performance using the proposed ICCLD with HRDA hoyer2022hrda, DAFormer hoyer2021daformer, and BAPA liu2021bapa. HRDA and DAFormer achieve the 2$^{\text{nd}}$ and 3$^{\text{rd}}$ ranked performances on mIoU (See Table \ref{['tab:uda_comp']}).BAPA performs best for the Veget classes on the Synthia $\rightarrow$ Cityscapes UDA setting. The visualisation of the segmentation results shows that the proposed method produces more precise segmentation performance than the other methods.