Table of Contents
Fetching ...

LangDA: Building Context-Awareness via Language for Domain Adaptive Semantic Segmentation

Chang Liu, Bavesh Balaji, Saad Hossain, C Thomas, Kwei-Herng Lai, Raviteja Vemulapalli, Alexander Wong, Sirisha Rambhatla

TL;DR

LangDA addresses unsupervised domain adaptation for semantic segmentation by introducing context-aware language guidance. It combines a context-aware caption generator with an image-level consistency alignment to encode spatial object relationships in language and align them with visual features via a CLIP-based latent space, all within an EMA-based teacher-student framework. The approach yields state-of-the-art results across three benchmarks (Synthia→Cityscapes, Cityscapes→ACDC, Cityscapes→DarkZurich), outperforming prior methods by meaningful margins and demonstrating robustness to hyperparameters. This work shows that language-derived context can significantly improve dense prediction under domain shifts, offering a practical, prompt-free pathway to better domain-invariant representations for semantic segmentation.

Abstract

Unsupervised domain adaptation for semantic segmentation (DASS) aims to transfer knowledge from a label-rich source domain to a target domain with no labels. Two key approaches in DASS are (1) vision-only approaches using masking or multi-resolution crops, and (2) language-based approaches that use generic class-wise prompts informed by target domain (e.g. "a {snowy} photo of a {class}"). However, the former is susceptible to noisy pseudo-labels that are biased to the source domain. The latter does not fully capture the intricate spatial relationships of objects -- key for dense prediction tasks. To this end, we propose LangDA. LangDA addresses these challenges by, first, learning contextual relationships between objects via VLM-generated scene descriptions (e.g. "a pedestrian is on the sidewalk, and the street is lined with buildings."). Second, LangDA aligns the entire image features with text representation of this context-aware scene caption and learns generalized representations via text. With this, LangDA sets the new state-of-the-art across three DASS benchmarks, outperforming existing methods by 2.6%, 1.4% and 3.9%.

LangDA: Building Context-Awareness via Language for Domain Adaptive Semantic Segmentation

TL;DR

LangDA addresses unsupervised domain adaptation for semantic segmentation by introducing context-aware language guidance. It combines a context-aware caption generator with an image-level consistency alignment to encode spatial object relationships in language and align them with visual features via a CLIP-based latent space, all within an EMA-based teacher-student framework. The approach yields state-of-the-art results across three benchmarks (Synthia→Cityscapes, Cityscapes→ACDC, Cityscapes→DarkZurich), outperforming prior methods by meaningful margins and demonstrating robustness to hyperparameters. This work shows that language-derived context can significantly improve dense prediction under domain shifts, offering a practical, prompt-free pathway to better domain-invariant representations for semantic segmentation.

Abstract

Unsupervised domain adaptation for semantic segmentation (DASS) aims to transfer knowledge from a label-rich source domain to a target domain with no labels. Two key approaches in DASS are (1) vision-only approaches using masking or multi-resolution crops, and (2) language-based approaches that use generic class-wise prompts informed by target domain (e.g. "a {snowy} photo of a {class}"). However, the former is susceptible to noisy pseudo-labels that are biased to the source domain. The latter does not fully capture the intricate spatial relationships of objects -- key for dense prediction tasks. To this end, we propose LangDA. LangDA addresses these challenges by, first, learning contextual relationships between objects via VLM-generated scene descriptions (e.g. "a pedestrian is on the sidewalk, and the street is lined with buildings."). Second, LangDA aligns the entire image features with text representation of this context-aware scene caption and learns generalized representations via text. With this, LangDA sets the new state-of-the-art across three DASS benchmarks, outperforming existing methods by 2.6%, 1.4% and 3.9%.

Paper Structure

This paper contains 21 sections, 6 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Synthia $\to$ Cityscapes: Progress of DASS over time. Improvements in UDA methods have plateaued in the last two years. Compared to MIC hoyer2023mic, which tries to learn spatial relationships only on vision domains, and CoPT mata2024CoPT, which employs generic language-priors, our proposed LangDA uses contextual information from descriptive captions, achieving state-of-the-art performance.
  • Figure 2: (a)Vision-only UDA leverages an EMA-updated teacher-student framework with consistency losses to segment unlabeled target data. (b)CoPT uses LLM-generated class-wise text prompts and performs pixel-level alignment (aligns pixel features to corresponding class prompts), not focusing on spatial relationships in language. They also require additional supervisory text prompts for target domain. (c) Our proposed method LangDA utilizes context-aware image captions and performs image-level alignment (aligns image features to the image captions) to facilitate context-aware domain-invariant adaptation. Words providing context are highlighted in green.
  • Figure 3: LangDA Architecture. LangDA is a prompt-driven UDA framework that leverages contextual language descriptions to bridge domain gaps between labeled source images and unlabeled target images. LangDA includes two modules: context-aware caption generation and language-consistency alignment. Left: Context-aware generation is a two step process. First, a captioning model generates captions that encode context relationships for the source image (e.g. "there is a sidewalk on one side of the street"). Then, the captions are improved by passing class names from ground truth labels into an LLM. Right: In the image-level consistency alignment module, an adapter (explained in \ref{['method:prompt_adapt']}) projects the image features from the trained network onto the same latent space as text embeddings. The LangDA image encoder is trained from scratch because the CLIP image encoder performs poorly on semantic segmentation tasks.
  • Figure 4: VLM Caption Generation Module. We generate scene descriptions for source images using a VLM liu2024llava. Class names are acquired from ground-truth labels $y_S^{(i)}$. We can see the VLM provides contextual relationships, such as "street is lined with buildings” and “numerous people walking along the sidewalk”.
  • Figure 5: LLM Caption Refinement Module We summarize generated captions with an LLM. We appended a system-level prompt to inform the LLM of our semantic segmentation objective. We can see that LLM preserves spatial relationships from the VLM captions, as in “Sidewalk has pedestrians and riders”.
  • ...and 4 more figures