Table of Contents
Fetching ...

P-MapNet: Far-seeing Map Generator Enhanced by both SDMap and HDMap Priors

Zhou Jiang, Zhenxin Zhu, Pengfei Li, Huan-ang Gao, Tianyuan Yuan, Yongliang Shi, Hang Zhao, Hao Zhao

TL;DR

The paper tackles the challenge of online HD map generation in regions lacking HDMap infrastructure by introducing P-MapNet, which jointly exploits SDMap priors from OpenStreetMap and an HDMap prior refined by a masked autoencoder. SDMap priors are fused into BEV features via multi-head cross-attention to mitigate misalignment, while the MAE-based HDMap prior refines the initial predictions to enforce realistic topology. On nuScenes and Argoverse2, P-MapNet yields substantial far-range gains, achieving up to 13.4% mIoU improvements at 240×60 m and up to 8.50 in vectorized AP, with HDMap priors improving perceptual realism by up to 6.34%; cross-dataset MAE pretraining indicates good generalization. The work demonstrates that combining weakly aligned SDMap skeletons with learned HDMap priors enables far-seeing HD map generation, offering practical benefits for online autonomous-driving perception and decision-making.

Abstract

Autonomous vehicles are gradually entering city roads today, with the help of high-definition maps (HDMaps). However, the reliance on HDMaps prevents autonomous vehicles from stepping into regions without this expensive digital infrastructure. This fact drives many researchers to study online HDMap generation algorithms, but the performance of these algorithms at far regions is still unsatisfying. We present P-MapNet, in which the letter P highlights the fact that we focus on incorporating map priors to improve model performance. Specifically, we exploit priors in both SDMap and HDMap. On one hand, we extract weakly aligned SDMap from OpenStreetMap, and encode it as an additional conditioning branch. Despite the misalignment challenge, our attention-based architecture adaptively attends to relevant SDMap skeletons and significantly improves performance. On the other hand, we exploit a masked autoencoder to capture the prior distribution of HDMap, which can serve as a refinement module to mitigate occlusions and artifacts. We benchmark on the nuScenes and Argoverse2 datasets. Through comprehensive experiments, we show that: (1) our SDMap prior can improve online map generation performance, using both rasterized (by up to $+18.73$ $\rm mIoU$) and vectorized (by up to $+8.50$ $\rm mAP$) output representations. (2) our HDMap prior can improve map perceptual metrics by up to $6.34\%$. (3) P-MapNet can be switched into different inference modes that covers different regions of the accuracy-efficiency trade-off landscape. (4) P-MapNet is a far-seeing solution that brings larger improvements on longer ranges. Codes and models are publicly available at https://jike5.github.io/P-MapNet.

P-MapNet: Far-seeing Map Generator Enhanced by both SDMap and HDMap Priors

TL;DR

The paper tackles the challenge of online HD map generation in regions lacking HDMap infrastructure by introducing P-MapNet, which jointly exploits SDMap priors from OpenStreetMap and an HDMap prior refined by a masked autoencoder. SDMap priors are fused into BEV features via multi-head cross-attention to mitigate misalignment, while the MAE-based HDMap prior refines the initial predictions to enforce realistic topology. On nuScenes and Argoverse2, P-MapNet yields substantial far-range gains, achieving up to 13.4% mIoU improvements at 240×60 m and up to 8.50 in vectorized AP, with HDMap priors improving perceptual realism by up to 6.34%; cross-dataset MAE pretraining indicates good generalization. The work demonstrates that combining weakly aligned SDMap skeletons with learned HDMap priors enables far-seeing HD map generation, offering practical benefits for online autonomous-driving perception and decision-making.

Abstract

Autonomous vehicles are gradually entering city roads today, with the help of high-definition maps (HDMaps). However, the reliance on HDMaps prevents autonomous vehicles from stepping into regions without this expensive digital infrastructure. This fact drives many researchers to study online HDMap generation algorithms, but the performance of these algorithms at far regions is still unsatisfying. We present P-MapNet, in which the letter P highlights the fact that we focus on incorporating map priors to improve model performance. Specifically, we exploit priors in both SDMap and HDMap. On one hand, we extract weakly aligned SDMap from OpenStreetMap, and encode it as an additional conditioning branch. Despite the misalignment challenge, our attention-based architecture adaptively attends to relevant SDMap skeletons and significantly improves performance. On the other hand, we exploit a masked autoencoder to capture the prior distribution of HDMap, which can serve as a refinement module to mitigate occlusions and artifacts. We benchmark on the nuScenes and Argoverse2 datasets. Through comprehensive experiments, we show that: (1) our SDMap prior can improve online map generation performance, using both rasterized (by up to ) and vectorized (by up to ) output representations. (2) our HDMap prior can improve map perceptual metrics by up to . (3) P-MapNet can be switched into different inference modes that covers different regions of the accuracy-efficiency trade-off landscape. (4) P-MapNet is a far-seeing solution that brings larger improvements on longer ranges. Codes and models are publicly available at https://jike5.github.io/P-MapNet.
Paper Structure (24 sections, 7 equations, 12 figures, 14 tables)

This paper contains 24 sections, 7 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Left: Since offline HDMap generation is cumbersome and expensive, people are pursuing online HDMap generation algorithms and our P-MapNet is an online HDMap generator enhanced by both SDMap and HDMap priors. Right: Despite the misalignment between SDMaps and HDMaps, our P-MapNet can significantly improve map generation performance, especially on the far side.
  • Figure 2: P-MapNet overview. P-MapNet is designed to accept either surrounding cameras or multi-modal inputs. It processes these inputs to extract sensors features and SDMap priors features, both represented in the Bird's Eye View (BEV) space. These features are then fused using an attention mechanism and subsequently refined by the HDMap prior module to produce results that closely align with real-world map data.
  • Figure 3: Different mask strategies. "Masked" refers to the pre-training inputs after applying various masking strategies, and "Epoch-1" and "Epoch-20" denote the reconstruction results at the first and twentieth epochs of the pre-training process, respectively.
  • Figure 4: Qualitative results. We conduct a comparative analysis within a range of 240m$\times$60m on nuScenes dataset and 120m$\times$60m on Argoverse2 dataset, utilizing C+L as input. In our notation, "S" indicates that our method utilizes only the SDMap priors, while "S+H" indicates the utilization of both. Our method consistently outperforms the baseline method under various weather conditions and in scenarios involving viewpoint occlusion.
  • Figure 5: Detailed runtime. We conduct runtime profiling of each component in P-MapNet at a range of 60 × 120m on one RTX 3090 GPU.
  • ...and 7 more figures