Table of Contents
Fetching ...

A Simple Recipe for Language-guided Domain Generalized Segmentation

Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette

TL;DR

This work targets domain-generalized semantic segmentation (DGSS) by leveraging CLIP pretraining and language as a source of randomization. The proposed FAMix recipe freezes most of the backbone, augments local feature styles using language-guided prompts via Prompt-driven Instance Normalization (PIN), and mixes source and mined styles at patch level through AdaIN-based style transfer. Local style mining builds class-specific style banks from language prompts, enabling diverse intermediate domains during training without extra images. Across synthetic and real-domain benchmarks, FAMix achieves state-of-the-art DGSS performance, outperforms decoder-probing fine-tuning approaches, and demonstrates the importance of language-driven augmentation over random noise. The approach provides a practical baseline that harnesses large-scale vision-language models for robust perception under domain shifts, with open-source code available."

Abstract

Generalization to new domains not seen during training is one of the long-standing challenges in deploying neural networks in real-world applications. Existing generalization techniques either necessitate external images for augmentation, and/or aim at learning invariant representations by imposing various alignment constraints. Large-scale pretraining has recently shown promising generalization capabilities, along with the potential of binding different modalities. For instance, the advent of vision-language models like CLIP has opened the doorway for vision models to exploit the textual modality. In this paper, we introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: (i) the preservation of the intrinsic CLIP robustness through minimal fine-tuning, (ii) language-driven local style augmentation, and (iii) randomization by locally mixing the source and augmented styles during training. Extensive experiments report state-of-the-art results on various generalization benchmarks. Code is accessible at https://github.com/astra-vision/FAMix .

A Simple Recipe for Language-guided Domain Generalized Segmentation

TL;DR

This work targets domain-generalized semantic segmentation (DGSS) by leveraging CLIP pretraining and language as a source of randomization. The proposed FAMix recipe freezes most of the backbone, augments local feature styles using language-guided prompts via Prompt-driven Instance Normalization (PIN), and mixes source and mined styles at patch level through AdaIN-based style transfer. Local style mining builds class-specific style banks from language prompts, enabling diverse intermediate domains during training without extra images. Across synthetic and real-domain benchmarks, FAMix achieves state-of-the-art DGSS performance, outperforms decoder-probing fine-tuning approaches, and demonstrates the importance of language-driven augmentation over random noise. The approach provides a practical baseline that harnesses large-scale vision-language models for robust perception under domain shifts, with open-source code available."

Abstract

Generalization to new domains not seen during training is one of the long-standing challenges in deploying neural networks in real-world applications. Existing generalization techniques either necessitate external images for augmentation, and/or aim at learning invariant representations by imposing various alignment constraints. Large-scale pretraining has recently shown promising generalization capabilities, along with the potential of binding different modalities. For instance, the advent of vision-language models like CLIP has opened the doorway for vision models to exploit the textual modality. In this paper, we introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: (i) the preservation of the intrinsic CLIP robustness through minimal fine-tuning, (ii) language-driven local style augmentation, and (iii) randomization by locally mixing the source and augmented styles during training. Extensive experiments report state-of-the-art results on various generalization benchmarks. Code is accessible at https://github.com/astra-vision/FAMix .
Paper Structure (21 sections, 4 equations, 5 figures, 14 tables, 2 algorithms)

This paper contains 21 sections, 4 equations, 5 figures, 14 tables, 2 algorithms.

Figures (5)

  • Figure 1: Mixing strategies. (Left) MixStyle zhou2021domain consists of a linear mixing between the feature statistics of the source domain(s) $\textbf{S}$ samples. (Right) We apply an augmentation $\mathcal{A(.)}$ on the source domain statistics, then perform linear mixing between original and augmented statistics. Intuitively, this enlarges the support of the training distribution by leveraging statistics beyond the source domain(s), as well as discovering intermediate domains. $\mathcal{A(.)}$ could be a language-driven or Gaussian noise augmentation, and we show that the former leads to better generalization results.
  • Figure 2: Overall process of FAMix. FAMix consists of two steps. (Left) Local style mining consists of dividing the low-level feature activations into patches, which are used for style mining using Prompt-driven Instance Normalization (PIN) fahes2023poda. Specifically, for each patch, the dominant class is queried from the ground truth, and the mined style is added to corresponding class-specific style bank. (Right) Training the segmentation network is performed with minimal fine-tuning of the backbone. At each iteration, the low-level feature activations are viewed as grids of patches. For each patch, the dominant class is queried using the ground truth, then a style is sampled from the corresponding style bank. Style randomization is performed by normalizing each patch in the grid by its statistics, and transferring the new style which is a mixing between the original style and the sampled one. The network is trained using only a cross-entropy loss.
  • Figure 3: Qualitative results.Columns 1-2: Image and ground truth (GT), Columns 3-4-5: DGSS methods results, Column 6: Our results. The models are trained on GTAV with ResNet-50 backbone.
  • Figure 4: Ablation of prompt set and freezing strategy.\ref{['fig:ablate_card_R']} Performance (mIoU %) on test datasets w.r.t. the number of random style prompts in $\mathcal{R}$. \ref{['fig:ablate_freeze']} Effect of freezing layers reporting on x-axis the last frozen layer. For example, 'L3' means freezing L1, L2 and L3. 'L4' ' indicates that the Layer4 is partially frozen.
  • Figure 5: Examples of failure cases.Columns 1-2: Image and Ground Truth (GT), Column 3: Baseline (Freeze ✗, Augment ✗, Mix ✗), Column 4: FAMix results. The models are trained on GTAV with ResNet-50 backbone.