Table of Contents
Fetching ...

Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using VLMs

Jeongkee Lim, Yusung Kim

TL;DR

This work tackles unsupervised domain adaptation for semantic segmentation under inconsistent taxonomies, including open and coarse-to-fine scenarios. It introduces CSI, a framework that leverages Vision Language Models (OWL-ViT and CLIP) to perform zero-shot relabeling of target-domain classes not present in the source, by building a From-To map, extracting and filtering patches, and pasting relabeled regions into pseudo labels. The approach integrates with existing UDA methods (e.g., MIC, DAFormer) and demonstrates substantial improvements in mIoU on benchmarks like Synthia→Cityscapes, including better handling of target-only and newly split classes. CSI demonstrates broad compatibility with different domain configurations and highlights the practical impact of combining strong segmentation reasoning with open-vocabulary semantic knowledge for robust cross-domain adaptation. The authors provide code and discuss limitations and future directions in supplemental material.

Abstract

The challenge of semantic segmentation in Unsupervised Domain Adaptation (UDA) emerges not only from domain shifts between source and target images but also from discrepancies in class taxonomies across domains. Traditional UDA research assumes consistent taxonomy between the source and target domains, thereby limiting their ability to recognize and adapt to the taxonomy of the target domain. This paper introduces a novel approach, Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using Vision Language Models (CSI), which effectively performs domain-adaptive semantic segmentation even in situations of source-target class mismatches. CSI leverages the semantic generalization potential of Visual Language Models (VLMs) to create synergy with previous UDA methods. It leverages segment reasoning obtained through traditional UDA methods, combined with the rich semantic knowledge embedded in VLMs, to relabel new classes in the target domain. This approach allows for effective adaptation to extended taxonomies without requiring any ground truth label for the target domain. Our method has shown to be effective across various benchmarks in situations of inconsistent taxonomy settings (coarse-to-fine taxonomy and open taxonomy) and demonstrates consistent synergy effects when integrated with previous state-of-the-art UDA methods. The implementation is available at http://github.com/jkee58/CSI.

Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using VLMs

TL;DR

This work tackles unsupervised domain adaptation for semantic segmentation under inconsistent taxonomies, including open and coarse-to-fine scenarios. It introduces CSI, a framework that leverages Vision Language Models (OWL-ViT and CLIP) to perform zero-shot relabeling of target-domain classes not present in the source, by building a From-To map, extracting and filtering patches, and pasting relabeled regions into pseudo labels. The approach integrates with existing UDA methods (e.g., MIC, DAFormer) and demonstrates substantial improvements in mIoU on benchmarks like Synthia→Cityscapes, including better handling of target-only and newly split classes. CSI demonstrates broad compatibility with different domain configurations and highlights the practical impact of combining strong segmentation reasoning with open-vocabulary semantic knowledge for robust cross-domain adaptation. The authors provide code and discuss limitations and future directions in supplemental material.

Abstract

The challenge of semantic segmentation in Unsupervised Domain Adaptation (UDA) emerges not only from domain shifts between source and target images but also from discrepancies in class taxonomies across domains. Traditional UDA research assumes consistent taxonomy between the source and target domains, thereby limiting their ability to recognize and adapt to the taxonomy of the target domain. This paper introduces a novel approach, Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using Vision Language Models (CSI), which effectively performs domain-adaptive semantic segmentation even in situations of source-target class mismatches. CSI leverages the semantic generalization potential of Visual Language Models (VLMs) to create synergy with previous UDA methods. It leverages segment reasoning obtained through traditional UDA methods, combined with the rich semantic knowledge embedded in VLMs, to relabel new classes in the target domain. This approach allows for effective adaptation to extended taxonomies without requiring any ground truth label for the target domain. Our method has shown to be effective across various benchmarks in situations of inconsistent taxonomy settings (coarse-to-fine taxonomy and open taxonomy) and demonstrates consistent synergy effects when integrated with previous state-of-the-art UDA methods. The implementation is available at http://github.com/jkee58/CSI.
Paper Structure (36 sections, 4 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 36 sections, 4 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: UDA consists of supervised learning on the source domain and unsupervised learning on the target domain. (a) The traditional UDA model learns all reasoning about the target domain from the teacher model. (b) Our CSI leverages VLMs to learn semantic information that does not exist in the target domain and combines it with known segment reasoning.
  • Figure 2: The first row is the classes of the source domain and the last row is the classes in the target domain. (a) is an example of consistent and (b)-(c) are the inconsistent taxonomies covered by our CSI. The figure is redrawn from TACS.
  • Figure 3: Overview of CSI combined with UDA. The blue line is typically the process of training the model on the source domain. The gray line is the process of adapting the model to the target domain. The red line is the process of adapting the model to the target domain with our proposed method.
  • Figure 4: Qualitative comparison of CSI with previous methods on Synthia-to-Cityscapes. CSI shows better segmentation performance for classes not in the source domain.
  • Figure 5: Performance for when relabeling started on Synthia-to-Cityscapes. The performance improvement decreases as the relabeling time is delayed.
  • ...and 2 more figures