Table of Contents
Fetching ...

Divide, Ensemble and Conquer: The Last Mile on Unsupervised Domain Adaptation for Semantic Segmentation

Tao Lian, Jose L. Gómez, Antonio M. López

TL;DR

DEC tackles the last mile of unsupervised domain adaptation for semantic segmentation by leveraging synthetic multi-source data through a divide-and-conquer approach: training category-specific models on grouped classes and fusing their outputs with an ensemble trained entirely on synthetic data. It demonstrates compatibility with existing UDA methods and achieves state-of-the-art results on Cityscapes, BDD100K, and Mapillary Vistas, narrowing the gap to supervised learning. The method relies on a division strategy that groups classes into four categories, stacks the corresponding source-category masks into a pseudo-image for ensemble training, and uses an EMA-updated fusion model to produce the final segmentation. Overall, DEC provides a flexible, efficient, and effective path to closer parity with SL in real-world semantic segmentation while maintaining broad compatibility with current UDA pipelines.

Abstract

The last mile of unsupervised domain adaptation (UDA) for semantic segmentation is the challenge of solving the syn-to-real domain gap. Recent UDA methods have progressed significantly, yet they often rely on strategies customized for synthetic single-source datasets (e.g., GTA5), which limits their generalisation to multi-source datasets. Conversely, synthetic multi-source datasets hold promise for advancing the last mile of UDA but remain underutilized in current research. Thus, we propose DEC, a flexible UDA framework for multi-source datasets. Following a divide-and-conquer strategy, DEC simplifies the task by categorizing semantic classes, training models for each category, and fusing their outputs by an ensemble model trained exclusively on synthetic datasets to obtain the final segmentation mask. DEC can integrate with existing UDA methods, achieving state-of-the-art performance on Cityscapes, BDD100K, and Mapillary Vistas, significantly narrowing the syn-to-real domain gap.

Divide, Ensemble and Conquer: The Last Mile on Unsupervised Domain Adaptation for Semantic Segmentation

TL;DR

DEC tackles the last mile of unsupervised domain adaptation for semantic segmentation by leveraging synthetic multi-source data through a divide-and-conquer approach: training category-specific models on grouped classes and fusing their outputs with an ensemble trained entirely on synthetic data. It demonstrates compatibility with existing UDA methods and achieves state-of-the-art results on Cityscapes, BDD100K, and Mapillary Vistas, narrowing the gap to supervised learning. The method relies on a division strategy that groups classes into four categories, stacks the corresponding source-category masks into a pseudo-image for ensemble training, and uses an EMA-updated fusion model to produce the final segmentation. Overall, DEC provides a flexible, efficient, and effective path to closer parity with SL in real-world semantic segmentation while maintaining broad compatibility with current UDA pipelines.

Abstract

The last mile of unsupervised domain adaptation (UDA) for semantic segmentation is the challenge of solving the syn-to-real domain gap. Recent UDA methods have progressed significantly, yet they often rely on strategies customized for synthetic single-source datasets (e.g., GTA5), which limits their generalisation to multi-source datasets. Conversely, synthetic multi-source datasets hold promise for advancing the last mile of UDA but remain underutilized in current research. Thus, we propose DEC, a flexible UDA framework for multi-source datasets. Following a divide-and-conquer strategy, DEC simplifies the task by categorizing semantic classes, training models for each category, and fusing their outputs by an ensemble model trained exclusively on synthetic datasets to obtain the final segmentation mask. DEC can integrate with existing UDA methods, achieving state-of-the-art performance on Cityscapes, BDD100K, and Mapillary Vistas, significantly narrowing the syn-to-real domain gap.
Paper Structure (23 sections, 4 equations, 9 figures, 12 tables, 2 algorithms)

This paper contains 23 sections, 4 equations, 9 figures, 12 tables, 2 algorithms.

Figures (9)

  • Figure 1: The overview of DEC. It consists of multiple category models and an ensemble model. Category models segment the image into category masks containing distinct classes. Subsequently, the ensemble model fuses these category masks to generate the final mask.
  • Figure 2: The training procedure for the ensemble model begins by utilising source category models $f_{\theta^j}^S$ to generate source category masks $\hat{\mathcal{Y}}_j^S$. Subsequently, these source category masks $\hat{\mathcal{Y}}_j^S$ are stacked into a pseudo-image $\mathcal{X}^E$ with $N_G$ channels. The student model $E_\theta$ is then trained with source category masks and the source label, and the teacher model $E_{\theta^{'}}$ is updated using the exponential moving average of $E_\theta$ after each training step. For inference, we employ target category models $f_{\theta^j}^T$ to generate target category masks and fuse them into a segmentation mask. It is crucial to note that $f_{\theta^j}^S$ and $f_{\theta^j}^T$ represent distinct models. $f_{\theta^j}^S$ is trained through SL with synthetic dataset, whereas $f_{\theta^j}^T$ is via UDA methods.
  • Figure 3: The remapping of a source semantic segmentation label $\mathcal{Y}^S$ to category labels $\{\mathcal{Y}_1^S,\mathcal{Y}_2^S,\mathcal{Y}_3^S,\mathcal{Y}_4^S\}$. For example, in the remapping to Background, only classes belonging to Background are kept; others are marked as other category (the white part in the visualisation of Background label).
  • Figure 4: Qualitative results of DEC and previous state-of-the-art method on Musketeers $\rightarrow$ Cityscapes (row 1-3), Musketeers $\rightarrow$ BDD100K (row 4-6) and Musketeers $\rightarrow$ Mapillary Vista (row 7-9). DEC enhances foreground classes, including traffic sign, car, motorcycle, person, and bus. It effectively achieves better segmentation for sidewalk, which is frequently susceptible to confusion with road.
  • Figure 5: Qualitative results of target category models and different ensemble models on Musketeers $\rightarrow$ Cityscapes. Subfigures (b-e) display predictions from target category models trained on Musketeers, while subfigures (g-j) show predictions from ensemble models trained on GTA5, Synscapes, UrbanSyn and Musketeers.
  • ...and 4 more figures