Table of Contents
Fetching ...

NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models

Siyu Li, Fei Teng, Yihong Cao, Kailun Yang, Zhiyong Li, Yaonan Wang

TL;DR

NRSeg is proposed, a noise-resilient learning framework for BEV semantic segmentation that achieves state-of-the-art performance, and a Perspective-Geometry Consistency Metric is proposed to quantitatively evaluate the guidance capability of generated data for model learning.

Abstract

Birds' Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems. Unsupervised and semi-supervised learning for BEV tasks, as pivotal for real-world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation. Specifically, a Perspective-Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi-Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non-mutual exclusivity inherent in BEV semantic segmentation tasks. Experimental results demonstrate that NRSeg achieves state-of-the-art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks, respectively. The source code will be made publicly available at https://github.com/lynn-yu/NRSeg.

NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models

TL;DR

NRSeg is proposed, a noise-resilient learning framework for BEV semantic segmentation that achieves state-of-the-art performance, and a Perspective-Geometry Consistency Metric is proposed to quantitatively evaluate the guidance capability of generated data for model learning.

Abstract

Birds' Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems. Unsupervised and semi-supervised learning for BEV tasks, as pivotal for real-world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation. Specifically, a Perspective-Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi-Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non-mutual exclusivity inherent in BEV semantic segmentation tasks. Experimental results demonstrate that NRSeg achieves state-of-the-art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks, respectively. The source code will be made publicly available at https://github.com/lynn-yu/NRSeg.

Paper Structure

This paper contains 20 sections, 31 equations, 5 figures, 12 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of the proposed NRSeg framework for BEV semantic segmentation. It includes a Perspective-Geometry Consistency Metric (PGCM) and a Bi-Distribution Parallel Prediction (BiDPP) module. PGCM includes (a) synthetic data jointly are generated by different world models using BEV labels, bounding boxes, and text; (b) reference masks are generated by back-projecting BEV semantic labels; (c) synthetic masks are produced using a general semantic segmentation model; (d) consistency scores of synthetic data are learned based on mask. BiDPP innovatively co-learns multinomial and Dirichlet distributions where (e) represents the multinomial distribution for semantic predictions, while introducing a consistency score derived from synthetic data noise analysis into the segmentation loss; (f) innovatively designs the Hierarchical Local Semantic Exclusion (HLSE) module to fully leverage evidential deep learning theory for fine-grained uncertainty quantification.
  • Figure 2: Visualization results for unsupervised domain adaptation. It presents a visual comparison of our method against DualCross work man2023dualcross. Our method represents bi-distribution prediction results, 'Ours-P' and 'Ours-D', respectively. Additionally, we also present the uncertainty prediction results, where darker colors indicate higher uncertainty. It can be observed that our method has stronger cross-domain adaptation ability.
  • Figure 3: Analysis of the convergence of training loss for the PGCM module. 'w' indicates that the PGCM module is used, and 'w/o' means not used.
  • Figure 4: Visualization results for semi-supervised learning. It shows the training results on 1/4 and 1/2 of the labeled data. It can be seen that the semantic results are more accurate for cases with more labeled samples.
  • Figure 5: Visualization results and consistent scores of synthetic data from PerlDiff zhang2024perldiff and MagicDrive gao2023magicdrive. (a) The first column displays the reference image of the dataset and the back-projected road mask. The latter two columns present the generated synthetic data and predicted masks from Mask2Former mask2former. (b) Consistency scores $\textbf{R}$ of the synthetic data corresponding to the image examples from top to bottom. (c) Compared to the reference image, the synthetic data within the red bounding box exhibits a slight offset in the road structure. Accordingly, the drivable area follows changed.