Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
Zhixiang Wei, Lin Chen, Yi Jin, Xiaoxiao Ma, Tianle Liu, Pengyang Ling, Ben Wang, Huaian Chen, Jinjin Zheng
TL;DR
The paper addresses domain generalization for semantic segmentation by leveraging Vision Foundation Models (VFMs) and introducing Rein, a parameter-efficient refinement module. Rein places learnable, low-rank tokens between backbone layers to generate instance-aware feature adjustments, enabling strong generalization with only a small fraction of trainable parameters. Across multiple VFMs and DGSS benchmarks, Rein consistently outperforms state-of-the-art methods, achieving, for example, $mIoU=78.4\%$ on Cityscapes with merely $1\%$ extra trainable backbone parameters and enabling effective synthetic-to-real transfer with high Cityscapes validation performance ($78.4\%$) and data-efficient gains to $82.5\%$ with $1/16$ Cityscapes training data. The approach is demonstrated as a plug-in adapter compatible with plain vision transformers and reveals that VFMs can serve as stronger backbones when refined via Reins, significantly advancing practical domain-general semantic segmentation.
Abstract
In this paper, we first assess and harness various Vision Foundation Models (VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS). Driven by the motivation that Leveraging Stronger pre-trained models and Fewer trainable parameters for Superior generalizability, we introduce a robust fine-tuning approach, namely Rein, to parameter-efficiently harness VFMs for DGSS. Built upon a set of trainable tokens, each linked to distinct instances, Rein precisely refines and forwards the feature maps from each layer to the next layer within the backbone. This process produces diverse refinements for different categories within a single image. With fewer trainable parameters, Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full parameter fine-tuning. Extensive experiments across various settings demonstrate that Rein significantly outperforms state-of-the-art methods. Remarkably, with just an extra 1% of trainable parameters within the frozen backbone, Rein achieves a mIoU of 78.4% on the Cityscapes, without accessing any real urban-scene datasets.Code is available at https://github.com/w1oves/Rein.git.
