Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

Zhixiang Wei; Lin Chen; Yi Jin; Xiaoxiao Ma; Tianle Liu; Pengyang Ling; Ben Wang; Huaian Chen; Jinjin Zheng

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

Zhixiang Wei, Lin Chen, Yi Jin, Xiaoxiao Ma, Tianle Liu, Pengyang Ling, Ben Wang, Huaian Chen, Jinjin Zheng

TL;DR

The paper addresses domain generalization for semantic segmentation by leveraging Vision Foundation Models (VFMs) and introducing Rein, a parameter-efficient refinement module. Rein places learnable, low-rank tokens between backbone layers to generate instance-aware feature adjustments, enabling strong generalization with only a small fraction of trainable parameters. Across multiple VFMs and DGSS benchmarks, Rein consistently outperforms state-of-the-art methods, achieving, for example, $mIoU=78.4\%$ on Cityscapes with merely $1\%$ extra trainable backbone parameters and enabling effective synthetic-to-real transfer with high Cityscapes validation performance ($78.4\%$) and data-efficient gains to $82.5\%$ with $1/16$ Cityscapes training data. The approach is demonstrated as a plug-in adapter compatible with plain vision transformers and reveals that VFMs can serve as stronger backbones when refined via Reins, significantly advancing practical domain-general semantic segmentation.

Abstract

In this paper, we first assess and harness various Vision Foundation Models (VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS). Driven by the motivation that Leveraging Stronger pre-trained models and Fewer trainable parameters for Superior generalizability, we introduce a robust fine-tuning approach, namely Rein, to parameter-efficiently harness VFMs for DGSS. Built upon a set of trainable tokens, each linked to distinct instances, Rein precisely refines and forwards the feature maps from each layer to the next layer within the backbone. This process produces diverse refinements for different categories within a single image. With fewer trainable parameters, Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full parameter fine-tuning. Extensive experiments across various settings demonstrate that Rein significantly outperforms state-of-the-art methods. Remarkably, with just an extra 1% of trainable parameters within the frozen backbone, Rein achieves a mIoU of 78.4% on the Cityscapes, without accessing any real urban-scene datasets.Code is available at https://github.com/w1oves/Rein.git.

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

TL;DR

on Cityscapes with merely

extra trainable backbone parameters and enabling effective synthetic-to-real transfer with high Cityscapes validation performance (

) and data-efficient gains to

with

Cityscapes training data. The approach is demonstrated as a plug-in adapter compatible with plain vision transformers and reveals that VFMs can serve as stronger backbones when refined via Reins, significantly advancing practical domain-general semantic segmentation.

Abstract

Paper Structure (20 sections, 13 equations, 8 figures, 16 tables, 1 algorithm)

This paper contains 20 sections, 13 equations, 8 figures, 16 tables, 1 algorithm.

Introduction
Related Works
Methods
Preliminary
Core of Rein
Details of Rein
Experiments
Settings
Comparison with State-of-The-Art Methods
Ablation Studies and Analysis
Conclusions
Acknowledgements
Fewer Trainable Parameters
Value of synthetic data
Ablation on decode head
...and 5 more sections

Figures (8)

Figure 1: Vision Foundation Models (VFMs) are stronger pre-trained models that serve as robust backbones, effortlessly outperforming previous state-of-the-art Domain Generalized Semantic Segmentation (DGSS), as shown in (a). Yet, the extensive parameters of VFMs make them challenging to train. To address this, we introduce a robust fine-tuning approach to efficiently harness VFMs for DGSS. As illustrated in (b) and (c), the proposed methods achieve superior generalizability with fewer trainable parameters within backbones.
Figure 2: An overview of proposed Rein. Rein primarily consists of a collection of low-rank learnable tokens, denoted as $T=\{T_1,T_2,\ldots,T_N\}$. These tokens establish direct connections to distinct instances, facilitating instance-level feature refinement. This mechanism results in the generation of an enhancement feature map $f'_i=f_i+Rein(f_i)$ for each layer within backbone. All parameters of MLPs are layer-shared to reduce the number of parameters. $M_f$, $M_Q$, and $M_S$ are features module, queries module, and segmentation module, respectively. The notation $max~\&~avg~\&~last$ refers to the equation Eq. (\ref{['eq:link']}) and Eq. (\ref{['eq:link2']}).
Figure 3: Qualitative Comparison under GTAV $\rightarrow$ Cityscapes (Citys) + BDD100K (BDD) + Mapillary (Map) generalization setting.
Figure 4: Ablation study on token length $m$.
Figure 5: The curves of training loss and test metrics display consistent trends across different VFMs and decode heads: intuitively, as trainable parameters increase from $0.00M (Freeze)\rightarrow 2.53M (Rein) \rightarrow 304.24M (Full)$, the training loss monotonically decreases, indicating that a greater number of trainable parameters indeed better fit the training dataset. However, the test metrics on the target dataset initially rise and then fall, forming an inverted U-shape. This pattern suggests that the "Full" baseline overfits the training data, leading to diminished test performance. These findings are aligned with our motivation that Leveraging Stronger pre-trained models and Fewer trainable parameters for Superior generalizability. The blue bar charts in the figure represent the average mIoU tested on the Cityscapes, BDD100K, and Mapillary datasets, while the yellow line denotes the training loss during fine-tuning on GTAV dataset.
...and 3 more figures

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

TL;DR

Abstract

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)