Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

Dong Zhao; Qi Zang; Nan Pu; Wenjing Li; Nicu Sebe; Zhun Zhong

Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

Dong Zhao, Qi Zang, Nan Pu, Wenjing Li, Nicu Sebe, Zhun Zhong

TL;DR

S2-Corr is proposed, a state-space-driven text-image correlation refinement mechanism that mitigates domain-induced distortions and produces more consistent text-image correlations under distribution changes and achieves superior cross-domain performance and efficiency compared to existing OV-SS approaches.

Abstract

Domain Generalization in Semantic Segmentation (DG-SS) aims to enable segmentation models to perform robustly in unseen environments. However, conventional DG-SS methods are restricted to a fixed set of known categories, limiting their applicability in open-world scenarios. Recent progress in Vision-Language Models (VLMs) has advanced Open-Vocabulary Semantic Segmentation (OV-SS) by enabling models to recognize a broader range of concepts. Yet, these models remain sensitive to domain shifts and struggle to maintain robustness when deployed in unseen environments, a challenge that is particularly severe in urban-driving scenarios. To bridge this gap, we introduce Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS), a new setting that jointly addresses unseen domains and unseen categories. We introduce the first benchmark for OVDG-SS in autonomous driving, addressing a previously unexplored problem and covering both synthetic-to-real and real-to-real generalization across diverse unseen domains and unseen categories. In OVDG-SS, we observe that domain shifts often distort text-image correlations in pre-trained VLMs, which hinders the performance of OV-SS models. To tackle this challenge, we propose S2-Corr, a state-space-driven text-image correlation refinement mechanism that mitigates domain-induced distortions and produces more consistent text-image correlations under distribution changes. Extensive experiments on our constructed benchmark demonstrate that the proposed method achieves superior cross-domain performance and efficiency compared to existing OV-SS approaches.

Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

TL;DR

Abstract

Paper Structure (29 sections, 10 equations, 12 figures, 15 tables, 1 algorithm)

This paper contains 29 sections, 10 equations, 12 figures, 15 tables, 1 algorithm.

Introduction
Related Work
Methodology
Revisiting and Analyzing OV-SS
Refining Correlation with State-Space Models
S$^2$-Corr
Experiments
Dataset and Evaluation
Implementation Details
Comparison with State of The Art
Ablation Study and Analysis
Conclusion
More Dataset Details
More Implementation Details
Text prompt templates.
...and 14 more sections

Figures (12)

Figure 1: Efficiency and performance comparison on OVDG-SS tasks using EVA02 ViT-B/16 as backbone. FPS is tested on images with a short edge of 480 and a long edge of 960. Our method achieves the best trade-off among generalization ability, speed, and parameter efficiency.
Figure 2: Overview of the proposed S$^2$-Corr. The upper part shows the CLIP-based encoding and correlation aggregation pipeline. The lower part illustrates our S$^2$-Corr, which refines text–image correlations using a specially designed chunked State-Space Models (SSM) aggregation scheme.
Figure 3: Effect of domain shift on text–image correlations of the class "sky" from EVA02 model eva02. Color map ranges from blue (low correlation) to red (high correlation). As the domain shift increases from left to right, the initial correlation maps become progressively noisier, with incorrect activations spreading across irrelevant regions.
Figure 4: Comparison of scanning strategies in state-space correlation aggregation between VMamba liu2024vmambavisualstatespace and ours. Our method introduces a learnable geometric decay ($*\gamma$) to suppress long-range noise within each chunk and uses a snake-shaped scanning strategy that preserves spatial continuity by passing end states between adjacent chunks.
Figure 5: Comparison of text–image correlation aggregation on seen and unseen classes from unseen domains. Our method yields clearer and more localized text–image correlations than CAT-Seg cho2024cat and Corelation Map (initially obtained by Eq. \ref{['initial_map']}), improving both seen and unseen predictions under domain shifts.
...and 7 more figures

Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

TL;DR

Abstract

Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)