Table of Contents
Fetching ...

VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation

Roberto Alcover-Couso, Marcos Escudero-Viñolo, Juan C. SanMiguel, Jesus Bescos

TL;DR

This work tackles open vocabulary semantic segmentation under domain shift by uniting Vision-Language Models with Unsupervised Domain Adaptation. It introduces FROVSS, a decoder-augmented OVSS architecture with robust text embeddings and prompt augmentation, and integrates a teacher-student UDA framework with image Mix-Up to leverage unlabeled target data while preserving open vocabulary reasoning. Key contributions include a multi-scale context, a cost-volume based decoder, robust per-prompt text embeddings, and a UDA extension (UDA-FROVSS) that enables cross-domain recognition of unseen categories without shared labels. The approach delivers state-of-the-art results across multiple datasets, notably surpassing prior UDA methods in challenging Synthia-to-Cityscapes settings, and demonstrates strong transferability with limited annotated data, highlighting practical impact for real-world, cross-domain semantic segmentation tasks.

Abstract

Segmentation models are typically constrained by the categories defined during training. To address this, researchers have explored two independent approaches: adapting Vision-Language Models (VLMs) and leveraging synthetic data. However, VLMs often struggle with granularity, failing to disentangle fine-grained concepts, while synthetic data-based methods remain limited by the scope of available datasets. This paper proposes enhancing segmentation accuracy across diverse domains by integrating Vision-Language reasoning with key strategies for Unsupervised Domain Adaptation (UDA). First, we improve the fine-grained segmentation capabilities of VLMs through multi-scale contextual data, robust text embeddings with prompt augmentation, and layer-wise fine-tuning in our proposed Foundational-Retaining Open Vocabulary Semantic Segmentation (FROVSS) framework. Next, we incorporate these enhancements into a UDA framework by employing distillation to stabilize training and cross-domain mixed sampling to boost adaptability without compromising generalization. The resulting UDA-FROVSS framework is the first UDA approach to effectively adapt across domains without requiring shared categories.

VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation

TL;DR

This work tackles open vocabulary semantic segmentation under domain shift by uniting Vision-Language Models with Unsupervised Domain Adaptation. It introduces FROVSS, a decoder-augmented OVSS architecture with robust text embeddings and prompt augmentation, and integrates a teacher-student UDA framework with image Mix-Up to leverage unlabeled target data while preserving open vocabulary reasoning. Key contributions include a multi-scale context, a cost-volume based decoder, robust per-prompt text embeddings, and a UDA extension (UDA-FROVSS) that enables cross-domain recognition of unseen categories without shared labels. The approach delivers state-of-the-art results across multiple datasets, notably surpassing prior UDA methods in challenging Synthia-to-Cityscapes settings, and demonstrates strong transferability with limited annotated data, highlighting practical impact for real-world, cross-domain semantic segmentation tasks.

Abstract

Segmentation models are typically constrained by the categories defined during training. To address this, researchers have explored two independent approaches: adapting Vision-Language Models (VLMs) and leveraging synthetic data. However, VLMs often struggle with granularity, failing to disentangle fine-grained concepts, while synthetic data-based methods remain limited by the scope of available datasets. This paper proposes enhancing segmentation accuracy across diverse domains by integrating Vision-Language reasoning with key strategies for Unsupervised Domain Adaptation (UDA). First, we improve the fine-grained segmentation capabilities of VLMs through multi-scale contextual data, robust text embeddings with prompt augmentation, and layer-wise fine-tuning in our proposed Foundational-Retaining Open Vocabulary Semantic Segmentation (FROVSS) framework. Next, we incorporate these enhancements into a UDA framework by employing distillation to stabilize training and cross-domain mixed sampling to boost adaptability without compromising generalization. The resulting UDA-FROVSS framework is the first UDA approach to effectively adapt across domains without requiring shared categories.

Paper Structure

This paper contains 37 sections, 12 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: State of the art for open vocabulary semantic segmentation underperforms when trained with small training sets. Results of CAT-Seg cho2023catseg trained on random subsets derived from three popular datasets with different amounts of images across three random seeds (maximum minimum range depicted by shadowed area). Performance evaluated in the COCO validation set coco.
  • Figure 2: Visual summary of contributions. In Figure \ref{['fig2a']}, we showcase the benefits of FROVSS in the standard OVSS setup (trained in the COCO dataset and evaluated in multiple datasets). Figures \ref{['fig2b']} and \ref{['fig2c']} illustrate the major challenge we tackle: training with task-specific datasets (Cityscapes in \ref{['fig2b']} and ADE in \ref{['fig2c']}) drastically reduces generalization of the model. To overcome such issue, our proposed combination of UDA and OVSS (UDA-FROVSS), presents high performance for task-specific datasets while preserving generalization across other datasets (see UDA-FROVSS in \ref{['fig2b']} and \ref{['fig2c']}). Note that these UDA models do not require labels for the task-specific dataset.
  • Figure 3: Proposed decoder for open vocabulary semantic segmentation, exemplified with the category: "car". We guide segmentation by refining the similarities between dense features extracted from the image encoder and the text features ($C$).
  • Figure 4: Prompt augmentation pipeline.
  • Figure 5: Overview of UDA-FROVSS, which combinines VLMs with UDA. Key Components are illustrated within delineated boxes: (1) Integration of a custom decoder alongside a fine-tuning strategy to effectively train the framework; (2) Adaptation of UDA techniques, incorporating a teacher-student framework and image mixup for domain robustness; (3) Generation of robust text embeddings for enhanced category recognition
  • ...and 10 more figures