Table of Contents
Fetching ...

Collaborating Foundation Models for Domain Generalized Semantic Segmentation

Yasser Benigmim, Subhankar Roy, Slim Essid, Vicky Kalogeiton, Stéphane Lathuilière

TL;DR

CLOUDS tackles domain generalized semantic segmentation by moving beyond style-based domain randomization to content-rich diversification through a collaborative suite of foundation models. It uses a CLIP-based encoder with a Mask2Former-style decoder, a diffusion model conditioned by LLM-generated prompts to synthesize diverse urban scenes, and SAM to refine pseudo-labels in a self-training loop with an EMA teacher. Ablation and extensive benchmarks on GTA/SYNTHIA-to-city-scale datasets show CLOUDS consistently outperforms traditional DGSS and open-vocabulary methods, including zero-shot adaptations, by several percent on averaged mIoU. The approach demonstrates the feasibility and value of integrating multiple foundation models to achieve robust, practical domain generalization in semantic segmentation.

Abstract

Domain Generalized Semantic Segmentation (DGSS) deals with training a model on a labeled source domain with the aim of generalizing to unseen domains during inference. Existing DGSS methods typically effectuate robust features by means of Domain Randomization (DR). Such an approach is often limited as it can only account for style diversification and not content. In this work, we take an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In detail, CLOUDS is a framework that integrates FMs of various kinds: (i) CLIP backbone for its robust feature representation, (ii) generative models to diversify the content, thereby covering various modes of the possible target distribution, and (iii) Segment Anything Model (SAM) for iteratively refining the predictions of the segmentation model. Extensive experiments show that our CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under varying weather conditions, notably outperforming prior methods by 5.6% and 6.7% on averaged miou, respectively. The code is available at : https://github.com/yasserben/CLOUDS

Collaborating Foundation Models for Domain Generalized Semantic Segmentation

TL;DR

CLOUDS tackles domain generalized semantic segmentation by moving beyond style-based domain randomization to content-rich diversification through a collaborative suite of foundation models. It uses a CLIP-based encoder with a Mask2Former-style decoder, a diffusion model conditioned by LLM-generated prompts to synthesize diverse urban scenes, and SAM to refine pseudo-labels in a self-training loop with an EMA teacher. Ablation and extensive benchmarks on GTA/SYNTHIA-to-city-scale datasets show CLOUDS consistently outperforms traditional DGSS and open-vocabulary methods, including zero-shot adaptations, by several percent on averaged mIoU. The approach demonstrates the feasibility and value of integrating multiple foundation models to achieve robust, practical domain generalization in semantic segmentation.

Abstract

Domain Generalized Semantic Segmentation (DGSS) deals with training a model on a labeled source domain with the aim of generalizing to unseen domains during inference. Existing DGSS methods typically effectuate robust features by means of Domain Randomization (DR). Such an approach is often limited as it can only account for style diversification and not content. In this work, we take an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In detail, CLOUDS is a framework that integrates FMs of various kinds: (i) CLIP backbone for its robust feature representation, (ii) generative models to diversify the content, thereby covering various modes of the possible target distribution, and (iii) Segment Anything Model (SAM) for iteratively refining the predictions of the segmentation model. Extensive experiments show that our CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under varying weather conditions, notably outperforming prior methods by 5.6% and 6.7% on averaged miou, respectively. The code is available at : https://github.com/yasserben/CLOUDS
Paper Structure (29 sections, 2 equations, 10 figures, 5 tables)

This paper contains 29 sections, 2 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Performance over time by various methods on the GTA $\rightarrow$ {Cityscapes, BDD, Mapillary} benchmark. Recent open-vocabulary approaches, like FC-CLIP, are shown to excel in zero-shot learning and surpass traditional domain generalization methods trained in closed-set scenarios, challenging the relevance of the DGSS setting. CLOUDS, by harnessing multiple foundation models, demonstrates its ability to effectively utilize the source dataset, thereby outperforming both conventional DGSS and open-vocabulary methods.
  • Figure 2: Qualitative comparison at inference: (a) Input image, (b) SHADE, a traditional style diversification DGSS method, (c) SAM, a foundation model that predicts precise class-agnostic maps, (d) GroundingSAM that leverages SAM and text-prompts to output semantic maps, and (d) Our proposed CLOUDS that leverages an assembly of foundation models to predict high-quality semantic maps.
  • Figure 3: Training pipeline of CLOUDS: The model integrates a CLIP image encoder with a MaskFormer decoder (Sec. \ref{['sub:clip']}). Our domain randomization strategy is based on a data generalization module (Sec. \ref{['sub:diffusion']}) that combines a Large Language Model (LLM) with a text-to-image diffusion model to generate a varied dataset, representative of potential target datasets. This data is then employed in a Self-Training framework (Sec. \ref{['sub:self_training']}), where initial pseudo labels (PL) prompt the Segment Anything Model (SAM) for refined pseudo labels, thereby fortifying the decoder's robustness.
  • Figure 4: Pseudo Label refinement with SAM. We extract binary masks from the predicted segmentation. After labeling connected components and filtering noisy ones, we select random points within each binary mask. These points, along with the corresponding RGB images, are then used to prompt SAM, enabling the generation of more accurate segmentation maps
  • Figure 5: Effect of generated dataset size on mIoU. Experiments performed on GTA $\rightarrow$ {Cityscapes, BDD, Mapillary}.
  • ...and 5 more figures