Table of Contents
Fetching ...

SAM4UDASS: When SAM Meets Unsupervised Domain Adaptive Semantic Segmentation in Intelligent Vehicles

Weihao Yan, Yeqiang Qian, Xingyuan Chen, Hanyang Zhuang, Chunxiang Wang, Ming Yang

TL;DR

SAM4UDASS integrates the Segment Anything Model into unsupervised domain-adaptive semantic segmentation to refine pseudo-labels for driving scene analysis. It introduces Semantic-Guided Mask Labeling to assign semantic labels to SAM masks using source-domain area statistics and a road-based rule, addressing rare and small objects. Three fusion strategies merge SAM and UDASS labels, resolving semantic-granularity gaps and improving target-domain supervision. Across GTA5-Cityscapes, SYNTHIA-Cityscapes, and Cityscapes-ACDC, SAM4UDASS yields consistent mIoU gains and, with MIC, achieves state-of-the-art results, while remaining compatible with existing self-training methods. The work demonstrates a practical, plug-and-play approach to leverage foundation-model masks for improved domain adaptation in intelligent-vehicle perception, with future work on runtime optimization and prompt-based SAM variants.

Abstract

Semantic segmentation plays a critical role in enabling intelligent vehicles to comprehend their surrounding environments. However, deep learning-based methods usually perform poorly in domain shift scenarios due to the lack of labeled data for training. Unsupervised domain adaptation (UDA) techniques have emerged to bridge the gap across different driving scenes and enhance model performance on unlabeled target environments. Although self-training UDA methods have achieved state-of-the-art results, the challenge of generating precise pseudo-labels persists. These pseudo-labels tend to favor majority classes, consequently sacrificing the performance of rare classes or small objects like traffic lights and signs. To address this challenge, we introduce SAM4UDASS, a novel approach that incorporates the Segment Anything Model (SAM) into self-training UDA methods for refining pseudo-labels. It involves Semantic-Guided Mask Labeling, which assigns semantic labels to unlabeled SAM masks using UDA pseudo-labels. Furthermore, we devise fusion strategies aimed at mitigating semantic granularity inconsistency between SAM masks and the target domain. SAM4UDASS innovatively integrate SAM with UDA for semantic segmentation in driving scenes and seamlessly complements existing self-training UDA methodologies. Extensive experiments on synthetic-to-real and normal-to-adverse driving datasets demonstrate its effectiveness. It brings more than 3% mIoU gains on GTA5-to-Cityscapes, SYNTHIA-to-Cityscapes, and Cityscapes-to-ACDC when using DAFormer and achieves SOTA when using MIC. The code will be available at https://github.com/ywher/SAM4UDASS.

SAM4UDASS: When SAM Meets Unsupervised Domain Adaptive Semantic Segmentation in Intelligent Vehicles

TL;DR

SAM4UDASS integrates the Segment Anything Model into unsupervised domain-adaptive semantic segmentation to refine pseudo-labels for driving scene analysis. It introduces Semantic-Guided Mask Labeling to assign semantic labels to SAM masks using source-domain area statistics and a road-based rule, addressing rare and small objects. Three fusion strategies merge SAM and UDASS labels, resolving semantic-granularity gaps and improving target-domain supervision. Across GTA5-Cityscapes, SYNTHIA-Cityscapes, and Cityscapes-ACDC, SAM4UDASS yields consistent mIoU gains and, with MIC, achieves state-of-the-art results, while remaining compatible with existing self-training methods. The work demonstrates a practical, plug-and-play approach to leverage foundation-model masks for improved domain adaptation in intelligent-vehicle perception, with future work on runtime optimization and prompt-based SAM variants.

Abstract

Semantic segmentation plays a critical role in enabling intelligent vehicles to comprehend their surrounding environments. However, deep learning-based methods usually perform poorly in domain shift scenarios due to the lack of labeled data for training. Unsupervised domain adaptation (UDA) techniques have emerged to bridge the gap across different driving scenes and enhance model performance on unlabeled target environments. Although self-training UDA methods have achieved state-of-the-art results, the challenge of generating precise pseudo-labels persists. These pseudo-labels tend to favor majority classes, consequently sacrificing the performance of rare classes or small objects like traffic lights and signs. To address this challenge, we introduce SAM4UDASS, a novel approach that incorporates the Segment Anything Model (SAM) into self-training UDA methods for refining pseudo-labels. It involves Semantic-Guided Mask Labeling, which assigns semantic labels to unlabeled SAM masks using UDA pseudo-labels. Furthermore, we devise fusion strategies aimed at mitigating semantic granularity inconsistency between SAM masks and the target domain. SAM4UDASS innovatively integrate SAM with UDA for semantic segmentation in driving scenes and seamlessly complements existing self-training UDA methodologies. Extensive experiments on synthetic-to-real and normal-to-adverse driving datasets demonstrate its effectiveness. It brings more than 3% mIoU gains on GTA5-to-Cityscapes, SYNTHIA-to-Cityscapes, and Cityscapes-to-ACDC when using DAFormer and achieves SOTA when using MIC. The code will be available at https://github.com/ywher/SAM4UDASS.
Paper Structure (28 sections, 11 equations, 9 figures, 7 tables, 2 algorithms)

This paper contains 28 sections, 11 equations, 9 figures, 7 tables, 2 algorithms.

Figures (9)

  • Figure 1: Illustration of the cross-domain driving scenes faced by intelligent vehicles and the resulting semantic segmentation model performance deterioration. Left to right: source domain images, target domain images, predictions on source and target domain images using the model trained on source domain.
  • Figure 2: The demonstration of pseudo-label refinement by SAM. (a) Image with ground truth, (b) SAM masks, (c) Pseudo-label from DAFormer, (d) Pseudo-label refined by SAM4UDASS. The improvements are marked with white boxes.
  • Figure 3: The overview of SAM4UDASS. The blue arrows depict the original self-training methods' flow, while the yellow ones represent our approach. The prediction $p_s$ for source image $x_s$ is supervised by the source label $y_s$. UDA pseudo-label $\hat{y}_{uda}$ and unlabeled masks $masks$ for target image $x_t$ are generated using the teacher model and SAM. Subsequently, SAM pseudo-label $\hat{y}_{sam}$ is derived through SGML. The Fusion module takes ($\hat{y}_{uda}$, $\hat{y}_{sam}$) as inputs and get the refined pseudo-label $\hat{y}_{t}$. ($x_s$, $x_t$) and ($y_s$, $\hat{y}_{t}$) are mixed using ClassMixclassmix and ($x_m$, $y_m$) are used to train the student network.
  • Figure 4: The demonstration of Majority Voting and SGML.
  • Figure 5: The demonstration of Fusion Strategy 1 and 3.
  • ...and 4 more figures