Table of Contents
Fetching ...

What is the Added Value of UDA in the VFM Era?

Brunó B. Englert, Tommie Kerssies, Gijs Dubbelman

TL;DR

This paper questions the continued utility of Unsupervised Domain Adaptation (UDA) in the era of Vision Foundation Models (VFMs) for autonomous-driving perception, focusing on semantic segmentation. It systematically compares UDA against source-only fine-tuning across synth-to-real and real-to-real scenarios, varying synthetic/real data diversity and incorporating limited target labels. The key finding is that strong synthetic sources reduce UDA's gains over source-only to a few mIoU, while diverse real-source data can yield little to no added value from UDA; however, synthetic UDA can match fully-supervised performance with only a small fraction of target labels, underscoring a nuanced, scenario-dependent utility. The work suggests focusing UDA on synth-to-real situations and as a fallback when domain gaps persist, guiding more representative data scenarios and practical adaptation strategies for real-world autonomous driving deployments.

Abstract

Unsupervised Domain Adaptation (UDA) can improve a perception model's generalization to an unlabeled target domain starting from a labeled source domain. UDA using Vision Foundation Models (VFMs) with synthetic source data can achieve generalization performance comparable to fully-supervised learning with real target data. However, because VFMs have strong generalization from their pre-training, more straightforward, source-only fine-tuning can also perform well on the target. As data scenarios used in academic research are not necessarily representative for real-world applications, it is currently unclear (a) how UDA behaves with more representative and diverse data and (b) if source-only fine-tuning of VFMs can perform equally well in these scenarios. Our research aims to close these gaps and, similar to previous studies, we focus on semantic segmentation as a representative perception task. We assess UDA for synth-to-real and real-to-real use cases with different source and target data combinations. We also investigate the effect of using a small amount of labeled target data in UDA. We clarify that while these scenarios are more realistic, they are not necessarily more challenging. Our results show that, when using stronger synthetic source data, UDA's improvement over source-only fine-tuning of VFMs reduces from +8 mIoU to +2 mIoU, and when using more diverse real source data, UDA has no added value. However, UDA generalization is always higher in all synthetic data scenarios than source-only fine-tuning and, when including only 1/16 of Cityscapes labels, synthetic UDA obtains the same state-of-the-art segmentation quality of 85 mIoU as a fully-supervised model using all labels. Considering the mixed results, we discuss how UDA can best support robust autonomous driving at scale.

What is the Added Value of UDA in the VFM Era?

TL;DR

This paper questions the continued utility of Unsupervised Domain Adaptation (UDA) in the era of Vision Foundation Models (VFMs) for autonomous-driving perception, focusing on semantic segmentation. It systematically compares UDA against source-only fine-tuning across synth-to-real and real-to-real scenarios, varying synthetic/real data diversity and incorporating limited target labels. The key finding is that strong synthetic sources reduce UDA's gains over source-only to a few mIoU, while diverse real-source data can yield little to no added value from UDA; however, synthetic UDA can match fully-supervised performance with only a small fraction of target labels, underscoring a nuanced, scenario-dependent utility. The work suggests focusing UDA on synth-to-real situations and as a fallback when domain gaps persist, guiding more representative data scenarios and practical adaptation strategies for real-world autonomous driving deployments.

Abstract

Unsupervised Domain Adaptation (UDA) can improve a perception model's generalization to an unlabeled target domain starting from a labeled source domain. UDA using Vision Foundation Models (VFMs) with synthetic source data can achieve generalization performance comparable to fully-supervised learning with real target data. However, because VFMs have strong generalization from their pre-training, more straightforward, source-only fine-tuning can also perform well on the target. As data scenarios used in academic research are not necessarily representative for real-world applications, it is currently unclear (a) how UDA behaves with more representative and diverse data and (b) if source-only fine-tuning of VFMs can perform equally well in these scenarios. Our research aims to close these gaps and, similar to previous studies, we focus on semantic segmentation as a representative perception task. We assess UDA for synth-to-real and real-to-real use cases with different source and target data combinations. We also investigate the effect of using a small amount of labeled target data in UDA. We clarify that while these scenarios are more realistic, they are not necessarily more challenging. Our results show that, when using stronger synthetic source data, UDA's improvement over source-only fine-tuning of VFMs reduces from +8 mIoU to +2 mIoU, and when using more diverse real source data, UDA has no added value. However, UDA generalization is always higher in all synthetic data scenarios than source-only fine-tuning and, when including only 1/16 of Cityscapes labels, synthetic UDA obtains the same state-of-the-art segmentation quality of 85 mIoU as a fully-supervised model using all labels. Considering the mixed results, we discuss how UDA can best support robust autonomous driving at scale.

Paper Structure

This paper contains 14 sections, 4 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: UDA methods vs. source-only baselines and fully-supervised oracles. While VFM-based UDA achieves generalization close to fully-supervised learning, the added value of UDA over simple source-only fine-tuning requires further investigation.
  • Figure 2: Dataset Overview. On the top left, we show source-only performance (mIoU) evaluated on Cityscapes cordts_cityscapes_2016 for models trained on different source datasets, including their domain gaps relative to the oracle (trained on Cityscapes), and dataset sizes (in thousands of samples). On the top right, a t-SNE visualization of DINOv2-L dinov2_2023[CLS] token embeddings shows a clear separation between synthetic and real datasets, while also capturing semantic similarities among them.