Table of Contents
Fetching ...

AugMapNet: Improving Spatial Latent Structure via BEV Grid Augmentation for Enhanced Vectorized Online HD Map Construction

Thomas Monninger, Md Zafar Anwar, Stanislaw Antol, Steffen Staab, Sihao Ding

TL;DR

AugMapNet tackles online vectorized HD map construction by enriching the latent BEV grid with dense spatial cues from a raster map while decoding vectorized map elements. It introduces latent BEV grid augmentation with gradient stopping to treat the raster-derived prior as immutable, and adds BEV processing CNN blocks to induce a more structured latent space. Empirical results on nuScenes and Argoverse2 show significant vector map gains, including strong improvements at longer perception ranges and successful transfer to the SQD-MapNet baseline, with latent-space analyses (PCA and mutual information) indicating closer alignment to ground-truth rasters. These findings demonstrate that integrating dense spatial supervision into BEV latent spaces can meaningfully enhance vectorized HD map construction for robust autonomous driving systems.

Abstract

Autonomous driving requires understanding infrastructure elements, such as lanes and crosswalks. To navigate safely, this understanding must be derived from sensor data in real-time and needs to be represented in vectorized form. Learned Bird's-Eye View (BEV) encoders are commonly used to combine a set of camera images from multiple views into one joint latent BEV grid. Traditionally, from this latent space, an intermediate raster map is predicted, providing dense spatial supervision but requiring post-processing into the desired vectorized form. More recent models directly derive infrastructure elements as polylines using vectorized map decoders, providing instance-level information. Our approach, Augmentation Map Network (AugMapNet), proposes latent BEV feature grid augmentation, a novel technique that significantly enhances the latent BEV representation. AugMapNet combines vector decoding and dense spatial supervision more effectively than existing architectures while remaining easy to integrate compared to other hybrid approaches. It additionally benefits from extra processing on its latent BEV features. Experiments on nuScenes and Argoverse2 datasets demonstrate significant improvements on vectorized map prediction of up to 13.3% over the StreamMapNet baseline on 60 m range and greater improvements on larger ranges. We confirm transferability by applying our method to another baseline, SQD-MapNet, and find similar improvements. A detailed analysis of the latent BEV grid confirms a more structured latent space of AugMapNet and shows the value of our novel concept beyond pure performance improvement. The code can be found at https://github.com/tmonnin/augmapnet

AugMapNet: Improving Spatial Latent Structure via BEV Grid Augmentation for Enhanced Vectorized Online HD Map Construction

TL;DR

AugMapNet tackles online vectorized HD map construction by enriching the latent BEV grid with dense spatial cues from a raster map while decoding vectorized map elements. It introduces latent BEV grid augmentation with gradient stopping to treat the raster-derived prior as immutable, and adds BEV processing CNN blocks to induce a more structured latent space. Empirical results on nuScenes and Argoverse2 show significant vector map gains, including strong improvements at longer perception ranges and successful transfer to the SQD-MapNet baseline, with latent-space analyses (PCA and mutual information) indicating closer alignment to ground-truth rasters. These findings demonstrate that integrating dense spatial supervision into BEV latent spaces can meaningfully enhance vectorized HD map construction for robust autonomous driving systems.

Abstract

Autonomous driving requires understanding infrastructure elements, such as lanes and crosswalks. To navigate safely, this understanding must be derived from sensor data in real-time and needs to be represented in vectorized form. Learned Bird's-Eye View (BEV) encoders are commonly used to combine a set of camera images from multiple views into one joint latent BEV grid. Traditionally, from this latent space, an intermediate raster map is predicted, providing dense spatial supervision but requiring post-processing into the desired vectorized form. More recent models directly derive infrastructure elements as polylines using vectorized map decoders, providing instance-level information. Our approach, Augmentation Map Network (AugMapNet), proposes latent BEV feature grid augmentation, a novel technique that significantly enhances the latent BEV representation. AugMapNet combines vector decoding and dense spatial supervision more effectively than existing architectures while remaining easy to integrate compared to other hybrid approaches. It additionally benefits from extra processing on its latent BEV features. Experiments on nuScenes and Argoverse2 datasets demonstrate significant improvements on vectorized map prediction of up to 13.3% over the StreamMapNet baseline on 60 m range and greater improvements on larger ranges. We confirm transferability by applying our method to another baseline, SQD-MapNet, and find similar improvements. A detailed analysis of the latent BEV grid confirms a more structured latent space of AugMapNet and shows the value of our novel concept beyond pure performance improvement. The code can be found at https://github.com/tmonnin/augmapnet

Paper Structure

This paper contains 33 sections, 9 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Schematic relation between output element and latent BEV grid. Rasterized decoding (left) and vectorized decoding (right) provide dense and sparse spatial supervision, respectively.
  • Figure 2: Overview of AugMapNet architecture. Camera images $U$ are processed by learned BEV encoder $e_{\mathrm{BEV}}$ into latent BEV grid $B_{\mathrm{enc}}$. AugMapNet employs a latent BEV grid augmentation mechanism that generates $\hat{M}_{\mathrm{raster}}$ (i.e., map semantic segmentation). Additional CNNs help structure the latent space. A vector map decoder $d_{\mathrm{vector}}$ predicts vectorized map $\hat{M}_{\mathrm{vector}}$. Dashed lines indicate gradient stop.
  • Figure 3: Qualitative result on nuScenes dataset. Input camera images ($U$) are at the top. Ground truth labels ($\mathcal{M}_{\mathrm{vector}}$) and predicted vectorized maps ($\hat{\mathcal{M}}_{\mathrm{vector}}$) are at the bottom. The orange circle highlights the crosswalk missed by StreamMapNet.
  • Figure 4: Rendering of GT $\mathcal{M}_{\mathrm{raster}}$ and $\mathcal{M}_{\mathrm{vector}}$, prediction $\hat{\mathcal{M}}_{\mathrm{vector}}$, and the top 3 principal components of the latent BEV grid input to $d_{\mathrm{vector}}$ for StreamMapNet and AugMapNet (PC1-3). Orange circles highlight elements missed by StreamMapNet.
  • Figure 5: Vectorized map prediction performance versus (a) Mutual Information between top 3 principal components and ground truth raster map and (b) Variance of latent BEV grid for all nuScenes val data points. (StreamMapNet: blue, AugMapNet: red)
  • ...and 7 more figures