Table of Contents
Fetching ...

DynAlign: Unsupervised Dynamic Taxonomy Alignment for Cross-Domain Segmentation

Han Sun, Rui Gong, Ismail Nejjar, Olga Fink

TL;DR

DynAlign tackles cross-domain semantic segmentation when source and target taxonomies differ, by combining domain-specific UDA with foundation-model priors. It employs a three-stage pipeline: domain knowledge to align image-level shifts, semantic taxonomy mapping via language models to bridge label-level gaps, and visual priors (SAM+CLIP) with a fusion mechanism to reassign labels and generate pseudo-labels. The approach achieves state-of-the-art results on GTA→Mapillary and GTA→IDD, including improved handling of unseen classes, and supports unsupervised adaptation to evolving taxonomies. The work highlights the practical potential of integrating domain knowledge with open-world priors for robust, annotation-free cross-domain semantic segmentation.

Abstract

Current unsupervised domain adaptation (UDA) methods for semantic segmentation typically assume identical class labels between the source and target domains. This assumption ignores the label-level domain gap, which is common in real-world scenarios, thus limiting their ability to identify finer-grained or novel categories without requiring extensive manual annotation. A promising direction to address this limitation lies in recent advancements in foundation models, which exhibit strong generalization abilities due to their rich prior knowledge. However, these models often struggle with domain-specific nuances and underrepresented fine-grained categories. To address these challenges, we introduce DynAlign, a framework that integrates UDA with foundation models to bridge both the image-level and label-level domain gaps. Our approach leverages prior semantic knowledge to align source categories with target categories that can be novel, more fine-grained, or named differently (e.g., vehicle to {car, truck, bus}). Foundation models are then employed for precise segmentation and category reassignment. To further enhance accuracy, we propose a knowledge fusion approach that dynamically adapts to varying scene contexts. DynAlign generates accurate predictions in a new target label space without requiring any manual annotations, allowing seamless adaptation to new taxonomies through either model retraining or direct inference. Experiments on the street scene semantic segmentation benchmarks GTA to Mapillary Vistas and GTA to IDD validate the effectiveness of our approach, achieving a significant improvement over existing methods. Our code will be publicly available.

DynAlign: Unsupervised Dynamic Taxonomy Alignment for Cross-Domain Segmentation

TL;DR

DynAlign tackles cross-domain semantic segmentation when source and target taxonomies differ, by combining domain-specific UDA with foundation-model priors. It employs a three-stage pipeline: domain knowledge to align image-level shifts, semantic taxonomy mapping via language models to bridge label-level gaps, and visual priors (SAM+CLIP) with a fusion mechanism to reassign labels and generate pseudo-labels. The approach achieves state-of-the-art results on GTA→Mapillary and GTA→IDD, including improved handling of unseen classes, and supports unsupervised adaptation to evolving taxonomies. The work highlights the practical potential of integrating domain knowledge with open-world priors for robust, annotation-free cross-domain semantic segmentation.

Abstract

Current unsupervised domain adaptation (UDA) methods for semantic segmentation typically assume identical class labels between the source and target domains. This assumption ignores the label-level domain gap, which is common in real-world scenarios, thus limiting their ability to identify finer-grained or novel categories without requiring extensive manual annotation. A promising direction to address this limitation lies in recent advancements in foundation models, which exhibit strong generalization abilities due to their rich prior knowledge. However, these models often struggle with domain-specific nuances and underrepresented fine-grained categories. To address these challenges, we introduce DynAlign, a framework that integrates UDA with foundation models to bridge both the image-level and label-level domain gaps. Our approach leverages prior semantic knowledge to align source categories with target categories that can be novel, more fine-grained, or named differently (e.g., vehicle to {car, truck, bus}). Foundation models are then employed for precise segmentation and category reassignment. To further enhance accuracy, we propose a knowledge fusion approach that dynamically adapts to varying scene contexts. DynAlign generates accurate predictions in a new target label space without requiring any manual annotations, allowing seamless adaptation to new taxonomies through either model retraining or direct inference. Experiments on the street scene semantic segmentation benchmarks GTA to Mapillary Vistas and GTA to IDD validate the effectiveness of our approach, achieving a significant improvement over existing methods. Our code will be publicly available.

Paper Structure

This paper contains 25 sections, 7 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: DynAlign and taxonomy adaptation. Current UDA methods focus solely on domain-specific knowledge transfer and assume consistent class labels across domains, limiting their flexibility in adapting to different taxonomies. Open-vocabulary segmentation models excel with broader taxonomies through large-scale pretraining but lack the precision of domain-specific models for specialized tasks. In contrast, DynAlign integrates with any UDA model and flexibly adapts to diverse taxonomies and scene contexts, leveraging the prior knowledge of foundational models.
  • Figure 2: DynAlign overview. DynAlign integrates with any UDA model, leveraging its domain-specific knowledge and enhancing it with prior knowledge from foundation models. DynAlign starts with coarse UDA model predictions, followed by: 1) LLM constructing taxonomy mappings to align source and target domains; 2) SAM generating fine-grained masks. CLIP is deployed to fuse the visual knowledge from SAM with the semantic knowledge from LLM to reassign accurate labels. The CLIP-fused predictions can be used as pseudo-labels to further fine-tune the UDA model.
  • Figure 3: Foundational models and knowledge fusion. The fine-grained mask proposals from SAM are encoded into multi-scale visual features using CLIP's vision encoder, while the enriched target domain taxonomies from LLM are encoded as context-aware text features via CLIP's text encoder. The similarity between these visual and text embeddings is then calculated to reassign semantic taxonomies accurately to the fine-grained masks in the target domain. Here, $\Phi^V(\cdot)$ and $\Phi^T(\cdot)$ denote the CLIP vision and text encoders, respectively. $F_l$ and ${\mathbb{F}}_g$ represent the local and global features, while $F_V$ denotes their weighted sum, forming the final extracted multi-scale visual feature to represent the mask region. ${\mathbb{F}}_T$ refers to the extracted text feature set of candidate classes.
  • Figure 4: Performance comparison between direct inference and pseudo-label training using DynAlign on the Mapillary Vistas dataset.
  • Figure 5: Qualitative comparisons on Mapillary Vistas dataset. DynAlign effectively segments new and fine-grained classes on the target domain, showing strong taxonomy adaptation capabilities.
  • ...and 1 more figures