Table of Contents
Fetching ...

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

Yuxiang Ji, Boyong He, Chenyuan Qu, Zhuoyue Tan, Chuan Qin, Liaoni Wu

TL;DR

This paper proposes DIffusion Feature Fusion (DIFF) as a backbone use for extracting and integrating effective semantic representations through the diffusion process and introduces a new training framework designed to implicitly learn posterior knowledge from it.

Abstract

Pre-trained diffusion models have demonstrated remarkable proficiency in synthesizing images across a wide range of scenarios with customizable prompts, indicating their effective capacity to capture universal features. Motivated by this, our study delves into the utilization of the implicit knowledge embedded within diffusion models to address challenges in cross-domain semantic segmentation. This paper investigates the approach that leverages the sampling and fusion techniques to harness the features of diffusion models efficiently. We propose DIffusion Feature Fusion (DIFF) as a backbone use for extracting and integrating effective semantic representations through the diffusion process. By leveraging the strength of text-to-image generation capability, we introduce a new training framework designed to implicitly learn posterior knowledge from it. Through rigorous evaluation in the contexts of domain generalization semantic segmentation, we establish that our methodology surpasses preceding approaches in mitigating discrepancies across distinct domains and attains the state-of-the-art (SOTA) benchmark.

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

TL;DR

This paper proposes DIffusion Feature Fusion (DIFF) as a backbone use for extracting and integrating effective semantic representations through the diffusion process and introduces a new training framework designed to implicitly learn posterior knowledge from it.

Abstract

Pre-trained diffusion models have demonstrated remarkable proficiency in synthesizing images across a wide range of scenarios with customizable prompts, indicating their effective capacity to capture universal features. Motivated by this, our study delves into the utilization of the implicit knowledge embedded within diffusion models to address challenges in cross-domain semantic segmentation. This paper investigates the approach that leverages the sampling and fusion techniques to harness the features of diffusion models efficiently. We propose DIffusion Feature Fusion (DIFF) as a backbone use for extracting and integrating effective semantic representations through the diffusion process. By leveraging the strength of text-to-image generation capability, we introduce a new training framework designed to implicitly learn posterior knowledge from it. Through rigorous evaluation in the contexts of domain generalization semantic segmentation, we establish that our methodology surpasses preceding approaches in mitigating discrepancies across distinct domains and attains the state-of-the-art (SOTA) benchmark.
Paper Structure (11 sections, 5 equations, 3 figures, 4 tables)

This paper contains 11 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of proposed diffusion feature fusion (DIFF) module and implicit posterior knowledge learning (IPKL) training pipeline. (a) In the conditional branch, we extract the categories and reference masks from the semantic segmentation annotations and use them as conditions, which are input into the DIFF module along with the input image. Through the DIFF module, we obtain features enhanced with conditional information for supervised training. (b) In the unconditional branch, we only use the image as input to the DIFF module and employ the prediction results from the conditional branch as a teacher for consistency learning.
  • Figure 2: Segmenting prediction on the unseen data of existing SOTA domain generalization (DG) semantic segmentation methods (DAFormerhoyerDAFormerImprovingNetwork2022, ReVTtermohlenReParameterizedVisionTransformer2023, CMFormerbiLearningContentenhancedMask2023a) and our method.
  • Figure 3: Heatmap (on sidewalk) and segmentation results comparision between (b) w/o. IPKL; (c) w/. IPKL and w/o. reference input; (d) w/. IPKL and w/. reference input.