Table of Contents
Fetching ...

Unveiling the Hidden: Online Vectorized HD Map Construction with Clip-Level Token Interaction and Propagation

Nayeon Kim, Hongje Seong, Daehyun Ji, Sujin Jang

TL;DR

A novel paradigm of clip-level vectorized HD map construction, MapUnveiler, which explicitly unveils the occluded map elements within a clip input by relating dense image representations with efficient clip tokens, effectively utilizing long-term temporal map information.

Abstract

Predicting and constructing road geometric information (e.g., lane lines, road markers) is a crucial task for safe autonomous driving, while such static map elements can be repeatedly occluded by various dynamic objects on the road. Recent studies have shown significantly improved vectorized high-definition (HD) map construction performance, but there has been insufficient investigation of temporal information across adjacent input frames (i.e., clips), which may lead to inconsistent and suboptimal prediction results. To tackle this, we introduce a novel paradigm of clip-level vectorized HD map construction, MapUnveiler, which explicitly unveils the occluded map elements within a clip input by relating dense image representations with efficient clip tokens. Additionally, MapUnveiler associates inter-clip information through clip token propagation, effectively utilizing long-term temporal map information. MapUnveiler runs efficiently with the proposed clip-level pipeline by avoiding redundant computation with temporal stride while building a global map relationship. Our extensive experiments demonstrate that MapUnveiler achieves state-of-the-art performance on both the nuScenes and Argoverse2 benchmark datasets. We also showcase that MapUnveiler significantly outperforms state-of-the-art approaches in a challenging setting, achieving +10.7% mAP improvement in heavily occluded driving road scenes. The project page can be found at https://mapunveiler.github.io.

Unveiling the Hidden: Online Vectorized HD Map Construction with Clip-Level Token Interaction and Propagation

TL;DR

A novel paradigm of clip-level vectorized HD map construction, MapUnveiler, which explicitly unveils the occluded map elements within a clip input by relating dense image representations with efficient clip tokens, effectively utilizing long-term temporal map information.

Abstract

Predicting and constructing road geometric information (e.g., lane lines, road markers) is a crucial task for safe autonomous driving, while such static map elements can be repeatedly occluded by various dynamic objects on the road. Recent studies have shown significantly improved vectorized high-definition (HD) map construction performance, but there has been insufficient investigation of temporal information across adjacent input frames (i.e., clips), which may lead to inconsistent and suboptimal prediction results. To tackle this, we introduce a novel paradigm of clip-level vectorized HD map construction, MapUnveiler, which explicitly unveils the occluded map elements within a clip input by relating dense image representations with efficient clip tokens. Additionally, MapUnveiler associates inter-clip information through clip token propagation, effectively utilizing long-term temporal map information. MapUnveiler runs efficiently with the proposed clip-level pipeline by avoiding redundant computation with temporal stride while building a global map relationship. Our extensive experiments demonstrate that MapUnveiler achieves state-of-the-art performance on both the nuScenes and Argoverse2 benchmark datasets. We also showcase that MapUnveiler significantly outperforms state-of-the-art approaches in a challenging setting, achieving +10.7% mAP improvement in heavily occluded driving road scenes. The project page can be found at https://mapunveiler.github.io.

Paper Structure

This paper contains 43 sections, 7 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: (a) Existing approaches relying on single-frame inference cannot capture the entire map information in the BEV features liu2023vectormapnetliao2023maptrliao2023maptrv2. (b) Recent alternatives explore temporal information via streaming, but they cannot address the inherent nature of maps and propagate noise from previous timestamps. (c) We directly unveil hidden maps in BEV features by interacting with clip tokens that contain high-level map information. We visualize BEV features by 1D PCA projection. The BEV features are extracted from (a) MapTRv2 liao2023maptrv2, (b) StreamMapNet yuan2024streammapnet, and (c) our MapUnveiler.
  • Figure 2: Our framework takes clip-level multi-view images and outputs clip-level vectorized HD maps. All components in the frame-level MapNet (i.e., Backbone, PV to BEV, Map Decoder) are adopted from MapTRv2 liao2023maptrv2. The frame-level MapNet extracts map queries and BEV features independently at each frame. MapUnveiler generates compact clip tokens that contain clip-level temporal map information and directly interact with dense BEV features. With the updated BEV features, we construct high-quality clip-level vectorized maps. The generated map tokens and clip tokens are then written to memory.
  • Figure 3: A detailed implementation of Intra-clip Unveiler. We use blue, red, and green arrows to indicate the flows of map tokens, BEV features, and clip tokens, respectively. In each attention and feed forward layer, standard layer normalization, dropout, and residual connections are followed.
  • Figure 4: Qualitative comparisons on two range variants of nuScenes val set: 60$\times$30$m$ and 100$\times$50$m$. We compare our MapUnveiler with MapTRv2 liao2023maptrv2 and StreamMapNet yuan2024streammapnet. We marked significant improvements from MapTRv2 and StreamMapNet using green boxes.
  • Figure 5: A limitation under heavy occlusions. MapUnveiler initially roughly predicts the boundary where we highlighted in green. However, the invisible regions are continuous, and MapUnveiler eventually predicts the region as having no boundary.
  • ...and 4 more figures