LangDA: Building Context-Awareness via Language for Domain Adaptive Semantic Segmentation
Chang Liu, Bavesh Balaji, Saad Hossain, C Thomas, Kwei-Herng Lai, Raviteja Vemulapalli, Alexander Wong, Sirisha Rambhatla
TL;DR
LangDA addresses unsupervised domain adaptation for semantic segmentation by introducing context-aware language guidance. It combines a context-aware caption generator with an image-level consistency alignment to encode spatial object relationships in language and align them with visual features via a CLIP-based latent space, all within an EMA-based teacher-student framework. The approach yields state-of-the-art results across three benchmarks (Synthia→Cityscapes, Cityscapes→ACDC, Cityscapes→DarkZurich), outperforming prior methods by meaningful margins and demonstrating robustness to hyperparameters. This work shows that language-derived context can significantly improve dense prediction under domain shifts, offering a practical, prompt-free pathway to better domain-invariant representations for semantic segmentation.
Abstract
Unsupervised domain adaptation for semantic segmentation (DASS) aims to transfer knowledge from a label-rich source domain to a target domain with no labels. Two key approaches in DASS are (1) vision-only approaches using masking or multi-resolution crops, and (2) language-based approaches that use generic class-wise prompts informed by target domain (e.g. "a {snowy} photo of a {class}"). However, the former is susceptible to noisy pseudo-labels that are biased to the source domain. The latter does not fully capture the intricate spatial relationships of objects -- key for dense prediction tasks. To this end, we propose LangDA. LangDA addresses these challenges by, first, learning contextual relationships between objects via VLM-generated scene descriptions (e.g. "a pedestrian is on the sidewalk, and the street is lined with buildings."). Second, LangDA aligns the entire image features with text representation of this context-aware scene caption and learns generalized representations via text. With this, LangDA sets the new state-of-the-art across three DASS benchmarks, outperforming existing methods by 2.6%, 1.4% and 3.9%.
