A Simple Recipe for Language-guided Domain Generalized Segmentation
Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette
TL;DR
This work targets domain-generalized semantic segmentation (DGSS) by leveraging CLIP pretraining and language as a source of randomization. The proposed FAMix recipe freezes most of the backbone, augments local feature styles using language-guided prompts via Prompt-driven Instance Normalization (PIN), and mixes source and mined styles at patch level through AdaIN-based style transfer. Local style mining builds class-specific style banks from language prompts, enabling diverse intermediate domains during training without extra images. Across synthetic and real-domain benchmarks, FAMix achieves state-of-the-art DGSS performance, outperforms decoder-probing fine-tuning approaches, and demonstrates the importance of language-driven augmentation over random noise. The approach provides a practical baseline that harnesses large-scale vision-language models for robust perception under domain shifts, with open-source code available."
Abstract
Generalization to new domains not seen during training is one of the long-standing challenges in deploying neural networks in real-world applications. Existing generalization techniques either necessitate external images for augmentation, and/or aim at learning invariant representations by imposing various alignment constraints. Large-scale pretraining has recently shown promising generalization capabilities, along with the potential of binding different modalities. For instance, the advent of vision-language models like CLIP has opened the doorway for vision models to exploit the textual modality. In this paper, we introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: (i) the preservation of the intrinsic CLIP robustness through minimal fine-tuning, (ii) language-driven local style augmentation, and (iii) randomization by locally mixing the source and augmented styles during training. Extensive experiments report state-of-the-art results on various generalization benchmarks. Code is accessible at https://github.com/astra-vision/FAMix .
