Table of Contents
Fetching ...

Strong but simple: A Baseline for Domain Generalized Dense Perception by CLIP-based Transfer Learning

Christoph Hümmer, Manuel Schwonberg, Liangwei Zhou, Hu Cao, Alois Knoll, Hanno Gottschalk

TL;DR

This work shows that simple fine-tuning of vision-language pre-trained encoders (e.g., CLIP or EVA-CLIP) with standard decoders can deliver competitive or superior domain generalization for dense perception tasks, without extra domain-generalization modules. By using Mask2Former for segmentation and ViTDet for detection, the authors demonstrate strong generalization on synthetic-to-real and real-to-real benchmarks, achieving state-of-the-art-like results on Cityscapes→ACDC and competitive Cityscapes in-domain performance. The key contribution is establishing a strong, practical baseline—VLTSeg and VLTDet—that leverages large-scale image-text pre-training to surpass traditional ImageNet-based transfer learning in domain generalization. The findings advocate for a shift toward vision-language pre-training as a robust, simple pathway to improve DG in dense perception tasks, with broad implications for real-world deployment where domain shifts are prevalent.

Abstract

Domain generalization (DG) remains a significant challenge for perception based on deep neural networks (DNNs), where domain shifts occur due to synthetic data, lighting, weather, or location changes. Vision-language models (VLMs) marked a large step for the generalization capabilities and have been already applied to various tasks. Very recently, first approaches utilized VLMs for domain generalized segmentation and object detection and obtained strong generalization. However, all these approaches rely on complex modules, feature augmentation frameworks or additional models. Surprisingly and in contrast to that, we found that simple fine-tuning of vision-language pre-trained models yields competitive or even stronger generalization results while being extremely simple to apply. Moreover, we found that vision-language pre-training consistently provides better generalization than the previous standard of vision-only pre-training. This challenges the standard of using ImageNet-based transfer learning for domain generalization. Fully fine-tuning a vision-language pre-trained model is capable of reaching the domain generalization SOTA when training on the synthetic GTA5 dataset. Moreover, we confirm this observation for object detection on a novel synthetic-to-real benchmark. We further obtain superior generalization capabilities by reaching 77.9% mIoU on the popular Cityscapes-to-ACDC benchmark. We also found improved in-domain generalization, leading to an improved SOTA of 86.4% mIoU on the Cityscapes test set marking the first place on the leaderboard.

Strong but simple: A Baseline for Domain Generalized Dense Perception by CLIP-based Transfer Learning

TL;DR

This work shows that simple fine-tuning of vision-language pre-trained encoders (e.g., CLIP or EVA-CLIP) with standard decoders can deliver competitive or superior domain generalization for dense perception tasks, without extra domain-generalization modules. By using Mask2Former for segmentation and ViTDet for detection, the authors demonstrate strong generalization on synthetic-to-real and real-to-real benchmarks, achieving state-of-the-art-like results on Cityscapes→ACDC and competitive Cityscapes in-domain performance. The key contribution is establishing a strong, practical baseline—VLTSeg and VLTDet—that leverages large-scale image-text pre-training to surpass traditional ImageNet-based transfer learning in domain generalization. The findings advocate for a shift toward vision-language pre-training as a robust, simple pathway to improve DG in dense perception tasks, with broad implications for real-world deployment where domain shifts are prevalent.

Abstract

Domain generalization (DG) remains a significant challenge for perception based on deep neural networks (DNNs), where domain shifts occur due to synthetic data, lighting, weather, or location changes. Vision-language models (VLMs) marked a large step for the generalization capabilities and have been already applied to various tasks. Very recently, first approaches utilized VLMs for domain generalized segmentation and object detection and obtained strong generalization. However, all these approaches rely on complex modules, feature augmentation frameworks or additional models. Surprisingly and in contrast to that, we found that simple fine-tuning of vision-language pre-trained models yields competitive or even stronger generalization results while being extremely simple to apply. Moreover, we found that vision-language pre-training consistently provides better generalization than the previous standard of vision-only pre-training. This challenges the standard of using ImageNet-based transfer learning for domain generalization. Fully fine-tuning a vision-language pre-trained model is capable of reaching the domain generalization SOTA when training on the synthetic GTA5 dataset. Moreover, we confirm this observation for object detection on a novel synthetic-to-real benchmark. We further obtain superior generalization capabilities by reaching 77.9% mIoU on the popular Cityscapes-to-ACDC benchmark. We also found improved in-domain generalization, leading to an improved SOTA of 86.4% mIoU on the Cityscapes test set marking the first place on the leaderboard.
Paper Structure (28 sections, 1 theorem, 2 equations, 6 figures, 11 tables)

This paper contains 28 sections, 1 theorem, 2 equations, 6 figures, 11 tables.

Key Result

lemma thmcounterlemma

Under the above assumptions, the encoder-decoder network $\mathbf{M}=\mathbf{M}^V_D\circ \mathbf{M}_E^V$ is domain robust, i.e. provides the same output to $\mathbf{x}$ and $\phi(\mathbf{x})$, with probability not smaller than $p$.

Figures (6)

  • Figure 1: CLIP-Based Fine-tuning for Domain Generalized Dense Perception: Previous works in domain generalization research mostly focus on applying methods (e.g. augmentations, consistency losses) in downstream training. In contrast, we simply transfer CLIP-based pre-trained weights (1), conduct fine-tuning on two different tasks (2) and evaluate the performance across various unseen real-world domains (3). Hereby, we reach a SOTA performance in domain generalized semantic segmentation and object detection.
  • Figure 2:
  • Figure 3:
  • Figure 4: Predictions on Cityscapes test set ${\mathcal{D}^\mathrm{CS}_\mathrm{test}}$. Training and evaluation was conducted as described in \ref{['sec:testset_eval']}
  • Figure 5: Predictions on the ACDC val set ${\mathcal{D}^\mathrm{ACDC}_\mathrm{val}}$. Training on ${\mathcal{D}^\mathrm{CS}_\mathrm{train}}$ and evaluation was conducted as described in \ref{['sec:testset_eval']}. Best viewed digital for the predictions.
  • ...and 1 more figures

Theorems & Definitions (2)

  • lemma thmcounterlemma
  • proof