Table of Contents
Fetching ...

Structured prototype regularization for synthetic-to-real driving scene parsing

Jiahe Fan, Xiao Ma, Sergey Vityazev, George Giakos, Shaolong Shu, Rui Fan

Abstract

Driving scene parsing is critical for autonomous vehicles to operate reliably in complex real-world traffic environments. To reduce the reliance on costly pixel-level annotations, synthetic datasets with automatically generated labels have become a popular alternative. However, models trained on synthetic data often perform poorly when applied to real-world scenes due to the synthetic-to-real domain gap. Despite the success of unsupervised domain adaptation in narrowing this gap, most existing methods mainly focus on global feature alignment while overlooking the semantic structure of the feature space. As a result, semantic relations among classes are insufficiently modeled, limiting the model's ability to generalize. To address these challenges, this study introduces a novel unsupervised domain adaptation framework that explicitly regularizes semantic feature structures to significantly enhance driving scene parsing performance in real-world scenarios. Specifically, the proposed method enforces inter-class separation and intra-class compactness by leveraging class-specific prototypes, thereby enhancing the discriminability and structural coherence of feature clusters. An entropy-based noise filtering strategy improves the reliability of pseudo labels, while a pixel-level attention mechanism further refines feature alignment. Extensive experiments on representative benchmarks demonstrate that the proposed method consistently outperforms recent state-of-the-art methods. These results underscore the importance of preserving semantic structure for robust synthetic-to-real adaptation in driving scene parsing tasks.

Structured prototype regularization for synthetic-to-real driving scene parsing

Abstract

Driving scene parsing is critical for autonomous vehicles to operate reliably in complex real-world traffic environments. To reduce the reliance on costly pixel-level annotations, synthetic datasets with automatically generated labels have become a popular alternative. However, models trained on synthetic data often perform poorly when applied to real-world scenes due to the synthetic-to-real domain gap. Despite the success of unsupervised domain adaptation in narrowing this gap, most existing methods mainly focus on global feature alignment while overlooking the semantic structure of the feature space. As a result, semantic relations among classes are insufficiently modeled, limiting the model's ability to generalize. To address these challenges, this study introduces a novel unsupervised domain adaptation framework that explicitly regularizes semantic feature structures to significantly enhance driving scene parsing performance in real-world scenarios. Specifically, the proposed method enforces inter-class separation and intra-class compactness by leveraging class-specific prototypes, thereby enhancing the discriminability and structural coherence of feature clusters. An entropy-based noise filtering strategy improves the reliability of pseudo labels, while a pixel-level attention mechanism further refines feature alignment. Extensive experiments on representative benchmarks demonstrate that the proposed method consistently outperforms recent state-of-the-art methods. These results underscore the importance of preserving semantic structure for robust synthetic-to-real adaptation in driving scene parsing tasks.
Paper Structure (34 sections, 19 equations, 11 figures, 9 tables)

This paper contains 34 sections, 19 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Image domain discrepancy and the effectiveness of the UDA method in driving scene parsing. The proposed SPR enhances feature alignment by explicitly modeling and preserving inter-class separability and intra-class compactness, leading to more structured and discriminative feature representations across domains.
  • Figure 2: Overview of the proposed SPR framework. The segmentation model is trained with cross-entropy loss $\mathcal{L}_{ce}$ on the source-domain dataset and contrastive loss $\mathcal{L}_{c}$ on datasets from both domains. Prototype–prototype interactions enforce inter-class separability and intra-class compactness by structurally regularizing class prototypes. Prototype–pixel interactions align pixel-wise features with these refined prototypes, incorporating entropy-based filtering and attention mechanisms to enhance semantic consistency in the target domain.
  • Figure 3: Illustration of prototype-based structural modeling and pixel-wise uncertainty estimation. (a) Generation of the inter-class and intra-class weighted prototypes, $\boldsymbol{\tilde{P}}_{e}$ and $\boldsymbol{\tilde{P}}_{a}$, via Prototype–Prototype interaction. (b) Estimation of the entropy map $\boldsymbol{H}$ and attention weight map $\boldsymbol{W}$ through Prototype–Pixel interaction.
  • Figure 4: Feature cluster visualization in the output space using t-SNE VanderMaaten2008.
  • Figure 5: Qualitative results obtained on the GTA5 $\to$ Cityscapes adaptation task. From left to right, the figure displays the input target-domain image and its corresponding ground truth, followed by the segmentation results predicted by four different methods: the source-only baseline model, AdaptSegNet Tsai2018, SPR, and SPR with the self-training strategy.
  • ...and 6 more figures