Table of Contents
Fetching ...

Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks

Sehwan Choi, Jungho Kim, Hongjae Shin, Jun Won Choi

TL;DR

Mask2Map tackles online vectorized HD map construction for autonomous driving by predicting class and ordered point sets of map instances in BEV through a two-network pipeline: IMPNet generates Mask-Aware Queries and BEV segmentation masks, while MMPNet refines these queries with instance-level positional and point-level geometric information via the Positional Query Generator and Geometric Feature Extractor. A novel Inter-network Denoising Training strategy mitigates cross-network GT matching inconsistencies, improving convergence and accuracy. On nuScenes and Argoverse2, Mask2Map delivers substantial improvements over prior state-of-the-art methods, including camera-only mAP gains around 10% and rasterization-based gains, demonstrating strong potential for real-time, vectorized HD maps in autonomous systems. These results suggest Mask2Map's approach effectively fuses global semantic context with local geometric detail to produce high-quality, deployable HD maps.

Abstract

In this paper, we introduce Mask2Map, a novel end-to-end online HD map construction method designed for autonomous driving applications. Our approach focuses on predicting the class and ordered point set of map instances within a scene, represented in the bird's eye view (BEV). Mask2Map consists of two primary components: the Instance-Level Mask Prediction Network (IMPNet) and the Mask-Driven Map Prediction Network (MMPNet). IMPNet generates Mask-Aware Queries and BEV Segmentation Masks to capture comprehensive semantic information globally. Subsequently, MMPNet enhances these query features using local contextual information through two submodules: the Positional Query Generator (PQG) and the Geometric Feature Extractor (GFE). PQG extracts instance-level positional queries by embedding BEV positional information into Mask-Aware Queries, while GFE utilizes BEV Segmentation Masks to generate point-level geometric features. However, we observed limited performance in Mask2Map due to inter-network inconsistency stemming from different predictions to Ground Truth (GT) matching between IMPNet and MMPNet. To tackle this challenge, we propose the Inter-network Denoising Training method, which guides the model to denoise the output affected by both noisy GT queries and perturbed GT Segmentation Masks. Our evaluation conducted on nuScenes and Argoverse2 benchmarks demonstrates that Mask2Map achieves remarkable performance improvements over previous state-of-the-art methods, with gains of 10.1% mAP and 4.1 mAP, respectively. Our code can be found at https://github.com/SehwanChoi0307/Mask2Map.

Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks

TL;DR

Mask2Map tackles online vectorized HD map construction for autonomous driving by predicting class and ordered point sets of map instances in BEV through a two-network pipeline: IMPNet generates Mask-Aware Queries and BEV segmentation masks, while MMPNet refines these queries with instance-level positional and point-level geometric information via the Positional Query Generator and Geometric Feature Extractor. A novel Inter-network Denoising Training strategy mitigates cross-network GT matching inconsistencies, improving convergence and accuracy. On nuScenes and Argoverse2, Mask2Map delivers substantial improvements over prior state-of-the-art methods, including camera-only mAP gains around 10% and rasterization-based gains, demonstrating strong potential for real-time, vectorized HD maps in autonomous systems. These results suggest Mask2Map's approach effectively fuses global semantic context with local geometric detail to produce high-quality, deployable HD maps.

Abstract

In this paper, we introduce Mask2Map, a novel end-to-end online HD map construction method designed for autonomous driving applications. Our approach focuses on predicting the class and ordered point set of map instances within a scene, represented in the bird's eye view (BEV). Mask2Map consists of two primary components: the Instance-Level Mask Prediction Network (IMPNet) and the Mask-Driven Map Prediction Network (MMPNet). IMPNet generates Mask-Aware Queries and BEV Segmentation Masks to capture comprehensive semantic information globally. Subsequently, MMPNet enhances these query features using local contextual information through two submodules: the Positional Query Generator (PQG) and the Geometric Feature Extractor (GFE). PQG extracts instance-level positional queries by embedding BEV positional information into Mask-Aware Queries, while GFE utilizes BEV Segmentation Masks to generate point-level geometric features. However, we observed limited performance in Mask2Map due to inter-network inconsistency stemming from different predictions to Ground Truth (GT) matching between IMPNet and MMPNet. To tackle this challenge, we propose the Inter-network Denoising Training method, which guides the model to denoise the output affected by both noisy GT queries and perturbed GT Segmentation Masks. Our evaluation conducted on nuScenes and Argoverse2 benchmarks demonstrates that Mask2Map achieves remarkable performance improvements over previous state-of-the-art methods, with gains of 10.1% mAP and 4.1 mAP, respectively. Our code can be found at https://github.com/SehwanChoi0307/Mask2Map.
Paper Structure (20 sections, 5 equations, 9 figures, 12 tables)

This paper contains 20 sections, 5 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Comparison of several online HD map construction methods: (a) Segmentation-based decoding, (b) detection-based decoding, (c) learnable query-based decoding, (d) proposed Mask2Map. Our Mask2Map utilizes Mask-Aware Queries to capture global-scale semantic information about a scene, enabling the generation of vectorized HD map components through subsequent query decoding.
  • Figure 2: Overall structure of Mask2Map. The Mask2Map system consists of IMPNet and MMPNet. IMPNet generates Mask-Aware Queries and BEV Segmentation Masks using Multi-scale BEV Features extracted from sensor data. Then, MMPNet predicts the class and ordered point set of map instances using PQG, GFE, and Mask-Guided Map Decoder. Both PQG and GFE generate semantic geometrical features on the map instances, and the Mask-Guided Map Decoder constructs vectorized maps based on these features.
  • Figure 3: Illustration of proposed Map Noise method. (a) The blue polygon denotes a vectorized GT of a pedestrian crossing. (b) The pink polygon represents a GT Segmentation Mask without noise. (c) The red polygon represents the result of adding Map Noise to the GT Segmentation Mask.
  • Figure 4: Qualitative results on the nuScenes validation set. We compared our method with MapTRv2. The regions marked by a red ellipse and rectangle emphasize the superior results generated by our proposed model.
  • Figure 5: Qualitative results under different scenarios on the Argoverse2 validation set. We compared our method with MapTRv2. Both methods utilized multi-view camera images as input and employed ResNet50 resnet as a backbone.
  • ...and 4 more figures