Table of Contents
Fetching ...

Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

Ruihao Xia, Yu Liang, Peng-Tao Jiang, Hao Zhang, Bo Li, Yang Tang, Pan Zhou

TL;DR

Modality Adaptation with text-to-image Diffusion Models (MADM) is proposed for semantic segmentation task which utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.

Abstract

Despite their success, unsupervised domain adaptation methods for semantic segmentation primarily focus on adaptation between image domains and do not utilize other abundant visual modalities like depth, infrared and event. This limitation hinders their performance and restricts their application in real-world multimodal scenarios. To address this issue, we propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task which utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities. Specifically, MADM comprises two key complementary components to tackle major challenges. First, due to the large modality gap, using one modal data to generate pseudo labels for another modality suffers from a significant drop in accuracy. To address this, MADM designs diffusion-based pseudo-label generation which adds latent noise to stabilize pseudo-labels and enhance label accuracy. Second, to overcome the limitations of latent low-resolution features in diffusion models, MADM introduces the label palette and latent regression which converts one-hot encoded labels into the RGB form by palette and regresses them in the latent space, thus ensuring the pre-trained decoder for up-sampling to obtain fine-grained features. Extensive experimental results demonstrate that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities. We open-source our code and models at https://github.com/XiaRho/MADM.

Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

TL;DR

Modality Adaptation with text-to-image Diffusion Models (MADM) is proposed for semantic segmentation task which utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.

Abstract

Despite their success, unsupervised domain adaptation methods for semantic segmentation primarily focus on adaptation between image domains and do not utilize other abundant visual modalities like depth, infrared and event. This limitation hinders their performance and restricts their application in real-world multimodal scenarios. To address this issue, we propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task which utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities. Specifically, MADM comprises two key complementary components to tackle major challenges. First, due to the large modality gap, using one modal data to generate pseudo labels for another modality suffers from a significant drop in accuracy. To address this, MADM designs diffusion-based pseudo-label generation which adds latent noise to stabilize pseudo-labels and enhance label accuracy. Second, to overcome the limitations of latent low-resolution features in diffusion models, MADM introduces the label palette and latent regression which converts one-hot encoded labels into the RGB form by palette and regresses them in the latent space, thus ensuring the pre-trained decoder for up-sampling to obtain fine-grained features. Extensive experimental results demonstrate that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities. We open-source our code and models at https://github.com/XiaRho/MADM.

Paper Structure

This paper contains 21 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: (1) On the left, we leverage the multi-modality model ImageBind ImageBind to quantify the similarity of images and modalities across datasets, i.e., GTA5-Synthetic GTA5, Dark Zurich-Nighttime DarkZurich, ACDC-Snow ACDC, DELIVER-Depth DELIVER, FMB-Infrared FMB, and DSEC-Event DSEC. Specifically, we randomly select 500 samples from each dataset, and compute the average cosine similarity of the output vectors within the dataset (right side of the text) and between the datasets (on the arrows). (2) On the right, we compare the quantitative results with the state-of-the-art (SoTA) method Rein Rein on three different modalities.
  • Figure 2: Our framework is divided into three parts. (1) Self-Training: Supervised loss in the source modality $\mathcal{L}_{s}$ and pseudo-labeled loss $\mathcal{L}_{t}$ in the target modality are used to train the network. (2) Diffusion-based Pseudo-Label Generation (DPLG): In the early stage of training, we add noise on the latent representation $z_t$ to stabilize the pseudo-label generation. (3) Label Palette and Latent Regression (LPLR): The one-hot encoded labels $y_s/\hat{y}_t$ are converted to RGB form by palette and then encoded to the latent space to supervise the UNet output $o_{s/t}$.
  • Figure 3: We visualize the pseudo-labels for event modality at the iteration of 1250, 1750, and 2250. The introduction of DPLG effectively improves the quality of pseudo-labels.
  • Figure 4: Qualitative semantic segmentation results generated by SoTA methods MIC MIC, Rein Rein, and our proposed MADM on three modalities.
  • Figure 5: At the 1,250th iteration, we present a visual analysis of diffusion step $k$ in DPLG.
  • ...and 3 more figures