Table of Contents
Fetching ...

The Last Mile to Supervised Performance: Semi-Supervised Domain Adaptation for Semantic Segmentation

Daniel Morales-Brotons, Grigorios Chrysos, Stratis Tzoumas, Volkan Cevher

TL;DR

This work investigates semi-supervised domain adaptation (SSDA) for semantic segmentation, addressing the challenge of achieving supervised-level performance with minimal target annotations. It proposes a simple, effective framework that combines consistency regularization, supervised pixel contrastive learning, and iterative self-training within a mean-teacher setup to leverage source labels, target unlabeled data, and a small set of target labels. The approach achieves state-of-the-art results in GTA→Cityscapes for low-label regimes, with 100–200 target labels approaching or surpassing fully supervised performance and demonstrates generalization to Synthia and BDD scenarios. It also analyzes how existing UDA and SSL methods perform in SSDA, offering design guidelines that prioritize tight clustering of target representations and domain robustness over explicit domain alignment. The findings highlight the practical value of SSDA for reducing annotation costs while delivering near-supervised performance in dense predictive tasks.

Abstract

Supervised deep learning requires massive labeled datasets, but obtaining annotations is not always easy or possible, especially for dense tasks like semantic segmentation. To overcome this issue, numerous works explore Unsupervised Domain Adaptation (UDA), which uses a labeled dataset from another domain (source), or Semi-Supervised Learning (SSL), which trains on a partially labeled set. Despite the success of UDA and SSL, reaching supervised performance at a low annotation cost remains a notoriously elusive goal. To address this, we study the promising setting of Semi-Supervised Domain Adaptation (SSDA). We propose a simple SSDA framework that combines consistency regularization, pixel contrastive learning, and self-training to effectively utilize a few target-domain labels. Our method outperforms prior art in the popular GTA-to-Cityscapes benchmark and shows that as little as 50 target labels can suffice to achieve near-supervised performance. Additional results on Synthia-to-Cityscapes, GTA-to-BDD and Synthia-to-BDD further demonstrate the effectiveness and practical utility of the method. Lastly, we find that existing UDA and SSL methods are not well-suited for the SSDA setting and discuss design patterns to adapt them.

The Last Mile to Supervised Performance: Semi-Supervised Domain Adaptation for Semantic Segmentation

TL;DR

This work investigates semi-supervised domain adaptation (SSDA) for semantic segmentation, addressing the challenge of achieving supervised-level performance with minimal target annotations. It proposes a simple, effective framework that combines consistency regularization, supervised pixel contrastive learning, and iterative self-training within a mean-teacher setup to leverage source labels, target unlabeled data, and a small set of target labels. The approach achieves state-of-the-art results in GTA→Cityscapes for low-label regimes, with 100–200 target labels approaching or surpassing fully supervised performance and demonstrates generalization to Synthia and BDD scenarios. It also analyzes how existing UDA and SSL methods perform in SSDA, offering design guidelines that prioritize tight clustering of target representations and domain robustness over explicit domain alignment. The findings highlight the practical value of SSDA for reducing annotation costs while delivering near-supervised performance in dense predictive tasks.

Abstract

Supervised deep learning requires massive labeled datasets, but obtaining annotations is not always easy or possible, especially for dense tasks like semantic segmentation. To overcome this issue, numerous works explore Unsupervised Domain Adaptation (UDA), which uses a labeled dataset from another domain (source), or Semi-Supervised Learning (SSL), which trains on a partially labeled set. Despite the success of UDA and SSL, reaching supervised performance at a low annotation cost remains a notoriously elusive goal. To address this, we study the promising setting of Semi-Supervised Domain Adaptation (SSDA). We propose a simple SSDA framework that combines consistency regularization, pixel contrastive learning, and self-training to effectively utilize a few target-domain labels. Our method outperforms prior art in the popular GTA-to-Cityscapes benchmark and shows that as little as 50 target labels can suffice to achieve near-supervised performance. Additional results on Synthia-to-Cityscapes, GTA-to-BDD and Synthia-to-BDD further demonstrate the effectiveness and practical utility of the method. Lastly, we find that existing UDA and SSL methods are not well-suited for the SSDA setting and discuss design patterns to adapt them.

Paper Structure

This paper contains 37 sections, 9 equations, 6 figures, 18 tables, 1 algorithm.

Figures (6)

  • Figure 1: GTA$\rightarrow$Cityscapes results (mIoU). Our method beats all baselines in the highlighted regime of interest: SSDA with a low amount of target labels. We claim SSDA as an alternative to UDA where near-supervised performance can be achieved at a low annotation cost. "Supervised" indicates a model trained on the full target dataset (2975 images). Fractions represent ratio of target-domain samples labeled. Results are an average of 3 runs on a DeepLabv2 + ResNet-101 network. See Tab. \ref{['tab:SSDA_sota']} for the results table.
  • Figure 2: Framework overview. In each round, we train a student model $f_\theta$ with a combination of supervised learning $\mathcal{L}^\textrm{sup}$, consistency regularization (CR) $\mathcal{L}^\textrm{CR}$ and pixel contrastive learning $\mathcal{L}^\textrm{PC}$. We use a mean teacher $f_\xi$ to generate pseudotargets in CR, and stop its gradient. In subsequent rounds of self-training, the target labeled set includes pseudolabels generated in the previous round.
  • Figure 3: SSL vs. SSDA semantic segmentation results (mIoU) on GTA$\rightarrow$Cityscapes for our method and AlonsoSSL. We show a substantial improvement when using source data (SSDA) compared to SSL, particularly in the low-label regime. The difference is less pronounced as more target labels are used. All results are the average of 3 runs on a DeepLabv2 with ResNet-101 backbone.
  • Figure 4: Evolution of performance during self-training from Algorithm \ref{['alg:ST']}. The first self-training round ($\textbf{M}_0\rightarrow\textbf{M}_1$) brings the largest improvement, the final ensemble ($\textbf{M}_1+\textbf{M}_2$) provides the best performance, and dropping pseudolabels for fine-tuning is beneficial. Results are an average over 3 runs for GTA$\rightarrow$Cityscapes on a DeepLabv2 + ResNet-101 network. A tabular version can be found in Tab. \ref{['tab:abl_ST']}.
  • Figure 5: Example of GTA images (source) stylized as a Cityscapes images (target) using LAB colorspace transformation he2021lab.
  • ...and 1 more figures