Table of Contents
Fetching ...

DiffMap: Enhancing Map Segmentation with Map Prior Using Diffusion Model

Peijin Jia, Tuopu Wen, Ziang Luo, Mengmeng Yang, Kun Jiang, Zhiquan Lei, Xuewei Tang, Ziyuan Liu, Le Cui, Bo Zhang, Long Huang, Diange Yang

TL;DR

DiffMap is proposed, a novel approach specifically designed to model the structured priors of map segmentation masks using latent diffusion model, which can be seamlessly integrated into any map segmentation model, thereby augmenting its capability to accurately delineate semantic information.

Abstract

Constructing high-definition (HD) maps is a crucial requirement for enabling autonomous driving. In recent years, several map segmentation algorithms have been developed to address this need, leveraging advancements in Bird's-Eye View (BEV) perception. However, existing models still encounter challenges in producing realistic and consistent semantic map layouts. One prominent issue is the limited utilization of structured priors inherent in map segmentation masks. In light of this, we propose DiffMap, a novel approach specifically designed to model the structured priors of map segmentation masks using latent diffusion model. By incorporating this technique, the performance of existing semantic segmentation methods can be significantly enhanced and certain structural errors present in the segmentation outputs can be effectively rectified. Notably, the proposed module can be seamlessly integrated into any map segmentation model, thereby augmenting its capability to accurately delineate semantic information. Furthermore, through extensive visualization analysis, our model demonstrates superior proficiency in generating results that more accurately reflect real-world map layouts, further validating its efficacy in improving the quality of the generated maps.

DiffMap: Enhancing Map Segmentation with Map Prior Using Diffusion Model

TL;DR

DiffMap is proposed, a novel approach specifically designed to model the structured priors of map segmentation masks using latent diffusion model, which can be seamlessly integrated into any map segmentation model, thereby augmenting its capability to accurately delineate semantic information.

Abstract

Constructing high-definition (HD) maps is a crucial requirement for enabling autonomous driving. In recent years, several map segmentation algorithms have been developed to address this need, leveraging advancements in Bird's-Eye View (BEV) perception. However, existing models still encounter challenges in producing realistic and consistent semantic map layouts. One prominent issue is the limited utilization of structured priors inherent in map segmentation masks. In light of this, we propose DiffMap, a novel approach specifically designed to model the structured priors of map segmentation masks using latent diffusion model. By incorporating this technique, the performance of existing semantic segmentation methods can be significantly enhanced and certain structural errors present in the segmentation outputs can be effectively rectified. Notably, the proposed module can be seamlessly integrated into any map segmentation model, thereby augmenting its capability to accurately delineate semantic information. Furthermore, through extensive visualization analysis, our model demonstrates superior proficiency in generating results that more accurately reflect real-world map layouts, further validating its efficacy in improving the quality of the generated maps.
Paper Structure (28 sections, 13 equations, 5 figures, 6 tables)

This paper contains 28 sections, 13 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) shows the problems of the traditional bev map segmentation model, (b) shows the prediction result of our model which effectively corrects the previous structural problems and is closer to the ground truth in (c).
  • Figure 2: Architecture Overview: After extracting features from surrounding multi-view images and LiDAR point clouds separately using backbone networks, the features are transformed into Bird's Eye View (BEV) space for fusion. During the training process, random noise is continuously added to the ground truth map. Then, in the denoising process, the fused BEV features are used as conditional control variables of Diffmap, ultimately generating the predicted segmentation map. Whereas in the inference process, results are obtained in the continuous denoising from random noise.
  • Figure 3: Denoising Module: In order to get a direct result of the segmented features, we decouple the UNet decoder into two branches, one branch is to predict noise $\epsilon$ as traditional diffusion model and the other is to predict the $\mathbf{z}$ in latent space. During the denoising process, we resize BEV features into latent space size as a conditional control variable. We first concatenate it with noisy latent map, and then incorporate it into the two decoders of UNet with cross attention mechanism.
  • Figure 4: Qualitative results on short range map segmentation.
  • Figure 5: Qualitative results on long range map segmentation: Diffmap is capable of capturing the structured prior and achieves the best segmentation results. The results demonstrate the effective restoration of parallel shapes for pedestrian crossings, smoothness and continuity of dividers, and shape complementation for boundaries.