Table of Contents
Fetching ...

Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation

Seogkyu Jeon, Kibeom Hong, Hyeran Byun

TL;DR

DGSS suffers from domain shifts that degrade cross-domain segmentation performance. DPMFormer addresses this with domain-aware prompt learning that injects input-domain properties into textual prompts and domain-robust consistency learning that enforces stable predictions under texture-based domain perturbations, all built on a Mask2Former/VLM backbone. Key contributions include a domain-aware contrastive loss to align text and image domain cues, texture perturbations to diversify observable domains, and multi-layer consistency losses to prevent error propagation. The results demonstrate state-of-the-art performance on synthetic-to-real and real-to-real DGSS benchmarks, with meaningful improvements across multiple domains and robust qualitative behavior under diverse styles.

Abstract

Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks. The code is available at https://github.com/jone1222/DPMFormer.

Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation

TL;DR

DGSS suffers from domain shifts that degrade cross-domain segmentation performance. DPMFormer addresses this with domain-aware prompt learning that injects input-domain properties into textual prompts and domain-robust consistency learning that enforces stable predictions under texture-based domain perturbations, all built on a Mask2Former/VLM backbone. Key contributions include a domain-aware contrastive loss to align text and image domain cues, texture perturbations to diversify observable domains, and multi-layer consistency losses to prevent error propagation. The results demonstrate state-of-the-art performance on synthetic-to-real and real-to-real DGSS benchmarks, with meaningful improvements across multiple domains and robust qualitative behavior under diverse styles.

Abstract

Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks. The code is available at https://github.com/jone1222/DPMFormer.

Paper Structure

This paper contains 28 sections, 6 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: Motivation of DPMFormer. Using a fixed context prompts lin2023clippak2024textual tend to retain source domain properties, causing contextual misalignment with the target domain. On the other hand, DPMFormer translates domain properties of the input image into context prompts, enhancing semantic alignments.
  • Figure 2: PCA visualization of textual prompts (left) and qualitative results on various environments (e.g., Day, Dawn, Night) in BDD100K yu2020bdd100k (right). The models are trained on GTAV richter2016playing with the CLIP-pretrained backbone (ViT-B) radford2021learning. A fixed single-context prompt lacks flexibility in adapting to various domain shifts due to its rigidity. In contrast, our framework utilizes domain-specific properties from input images as context prompts, enhancing semantic alignment between text and images. As a result, as shown in (b), our approach exhibits improved robustness across diverse environments.
  • Figure 3: Illustration of DPMFormer. We use Mask2Former cheng2022masked based architecture which consists of a backbone image encoder ($ENC_{I}$), a pixel decoder ($DEC_{\text{pix}}$), a transformer decoder ($DEC_{\text{tr}}$), and a text encoder ($ENC_{T})$. During training, we synthesize images with a novel domain style via texture perturbation. Both images are incorporated to compose a batch and exploited for learning domain-awareness (Sec. \ref{['sec:domain_awareness']}) and domain-robustness (Sec. \ref{['sec:domain_robustness']}).
  • Figure 4: Qualitative comparison on synthetic-to-real scenario with the CLIP-pretrained backbone (ViT-B). The training source domain is set as GTAV richter2016playing while the target domains are BDD100K yu2020bdd100k and Mapillary neuhold2017mapillary. The overall result shows that DPMFormer accomplishes precise segmentation with the images of strong illumination contrast as well as confusing textures.
  • Figure 5: Qualitative results on diverse styles, i.e., Minimalist, Pop Art, Bauhaus, and Cubism. The models are trained on GTAV with CLIP pretrained backbone (ViT-B).
  • ...and 7 more figures