VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation
Roberto Alcover-Couso, Marcos Escudero-Viñolo, Juan C. SanMiguel, Jesus Bescos
TL;DR
This work tackles open vocabulary semantic segmentation under domain shift by uniting Vision-Language Models with Unsupervised Domain Adaptation. It introduces FROVSS, a decoder-augmented OVSS architecture with robust text embeddings and prompt augmentation, and integrates a teacher-student UDA framework with image Mix-Up to leverage unlabeled target data while preserving open vocabulary reasoning. Key contributions include a multi-scale context, a cost-volume based decoder, robust per-prompt text embeddings, and a UDA extension (UDA-FROVSS) that enables cross-domain recognition of unseen categories without shared labels. The approach delivers state-of-the-art results across multiple datasets, notably surpassing prior UDA methods in challenging Synthia-to-Cityscapes settings, and demonstrates strong transferability with limited annotated data, highlighting practical impact for real-world, cross-domain semantic segmentation tasks.
Abstract
Segmentation models are typically constrained by the categories defined during training. To address this, researchers have explored two independent approaches: adapting Vision-Language Models (VLMs) and leveraging synthetic data. However, VLMs often struggle with granularity, failing to disentangle fine-grained concepts, while synthetic data-based methods remain limited by the scope of available datasets. This paper proposes enhancing segmentation accuracy across diverse domains by integrating Vision-Language reasoning with key strategies for Unsupervised Domain Adaptation (UDA). First, we improve the fine-grained segmentation capabilities of VLMs through multi-scale contextual data, robust text embeddings with prompt augmentation, and layer-wise fine-tuning in our proposed Foundational-Retaining Open Vocabulary Semantic Segmentation (FROVSS) framework. Next, we incorporate these enhancements into a UDA framework by employing distillation to stabilize training and cross-domain mixed sampling to boost adaptability without compromising generalization. The resulting UDA-FROVSS framework is the first UDA approach to effectively adapt across domains without requiring shared categories.
