Table of Contents
Fetching ...

PriorMapNet: Enhancing Online Vectorized HD Map Construction with Priors

Rongxuan Wang, Xin Lu, Xiaoyang Liu, Xiaoyi Zou, Tongyi Cao, Ying Li

TL;DR

PriorMapNet addresses unstable matching in online vectorized HD map construction by injecting priors into both the decoder and encoder. It introduces the PPS-Decoder with position and structure priors derived from offline clustering, the PF-Encoder that provides BEV feature priors, and the DMD cross-attention to improve efficiency. The approach achieves state-of-the-art results on nuScenes and Argoverse 2, demonstrates robustness across enlarged BEV ranges, and maintains practical inference speed. Limitations include the absence of semantic priors and reliance on single-frame inputs, pointing to future work on temporal integration and richer priors.

Abstract

Online vectorized High-Definition (HD) map construction is crucial for subsequent prediction and planning tasks in autonomous driving. Following MapTR paradigm, recent works have made noteworthy achievements. However, reference points are randomly initialized in mainstream methods, leading to unstable matching between predictions and ground truth. To address this issue, we introduce PriorMapNet to enhance online vectorized HD map construction with priors. We propose the PPS-Decoder, which provides reference points with position and structure priors. Fitted from the map elements in the dataset, prior reference points lower the learning difficulty and achieve stable matching. Furthermore, we propose the PF-Encoder to enhance the image-to-BEV transformation with BEV feature priors. Besides, we propose the DMD cross-attention, which decouples cross-attention along multi-scale and multi-sample respectively to achieve efficiency. Our proposed PriorMapNet achieves state-of-the-art performance in the online vectorized HD map construction task on nuScenes and Argoverse2 datasets. The code will be released publicly soon.

PriorMapNet: Enhancing Online Vectorized HD Map Construction with Priors

TL;DR

PriorMapNet addresses unstable matching in online vectorized HD map construction by injecting priors into both the decoder and encoder. It introduces the PPS-Decoder with position and structure priors derived from offline clustering, the PF-Encoder that provides BEV feature priors, and the DMD cross-attention to improve efficiency. The approach achieves state-of-the-art results on nuScenes and Argoverse 2, demonstrates robustness across enlarged BEV ranges, and maintains practical inference speed. Limitations include the absence of semantic priors and reliance on single-frame inputs, pointing to future work on temporal integration and richer priors.

Abstract

Online vectorized High-Definition (HD) map construction is crucial for subsequent prediction and planning tasks in autonomous driving. Following MapTR paradigm, recent works have made noteworthy achievements. However, reference points are randomly initialized in mainstream methods, leading to unstable matching between predictions and ground truth. To address this issue, we introduce PriorMapNet to enhance online vectorized HD map construction with priors. We propose the PPS-Decoder, which provides reference points with position and structure priors. Fitted from the map elements in the dataset, prior reference points lower the learning difficulty and achieve stable matching. Furthermore, we propose the PF-Encoder to enhance the image-to-BEV transformation with BEV feature priors. Besides, we propose the DMD cross-attention, which decouples cross-attention along multi-scale and multi-sample respectively to achieve efficiency. Our proposed PriorMapNet achieves state-of-the-art performance in the online vectorized HD map construction task on nuScenes and Argoverse2 datasets. The code will be released publicly soon.
Paper Structure (20 sections, 7 equations, 9 figures, 10 tables)

This paper contains 20 sections, 7 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Comparison of the unstable matching scores, the lower, the better. (a) and (b) denote the unstable matching scores during validation and training, respectively. $u$ means the percentage of queries whose GT match changed compared with the previous decoder layer, and $u_t$ means the percentage of final output queries whose GT match changed compared with the first decoder layer. "Queris with Priors" denote the queries corresponding to Prior Reference Points.
  • Figure 2: Comparison of the matching of MapTRv2 and our proposed method. Reference points with position and structure priors achieve stable matching.
  • Figure 3: The overview of our proposed PriorMapNet. Given multi-view images as input, the output is a set of map elements. PriorMapNet consists of three modules: the backbone, the PF-Encoder and the PPS-Decoder. The backbone extracts image features by using the ResNet and a FPN neck. The PF-Encoder transforms image features into BEV and downsamples it to multiple scales. The PPS-Decoder predicts map elements through Transformer, and reference points with priors are used for stable matching. In the cross-attention layer, the DMD cross-attention is used to achieve efficiency.
  • Figure 4: Comparison of the decoder of MapTRv2, MGMap and our proposed PriorMapNet. For simplicity, we only show the first layer in the transformer decoder. (a) MapTRv2 uses randomly initialized learnable query positions for all layers without any adaptation, which brings unstable matching results. (b) MGMap adds Mask-Activated Instance to provide semantic priors, but lacks position information. In contrast, (c) PriorMapNet enhances reference points with priors, which achieves stable matching.
  • Figure 5: Comparison of the vanilla MSDA and our proposed DMD cross-attention. DMD cross-attention performs cross-attention along multi-scale and multi-sample respectively to achieve efficiency.
  • ...and 4 more figures