Table of Contents
Fetching ...

MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction

Bencheng Liao, Shaoyu Chen, Yunchi Zhang, Bo Jiang, Qian Zhang, Wenyu Liu, Chang Huang, Xinggang Wang

TL;DR

MapTRv2 introduces an end-to-end framework for online vectorized HD map construction that models map elements as permutation-equivalent point sets, enabling stable learning for arbitrarily shaped elements. It leverages a hierarchical query embedding and decoupled self-attention within a Transformer encoder-decoder to efficiently predict both undirected and directed map elements, supported by hierarchical bipartite matching and auxiliary dense supervision to accelerate convergence. The approach achieves real-time performance and state-of-the-art accuracy on nuScenes and Argoverse2, and extends to centerline learning and 3D map reconstruction. These advances offer a practical, scalable module for autonomous driving pipelines and downstream planning tasks.

Abstract

High-definition (HD) map provides abundant and precise static environmental information of the driving scene, serving as a fundamental and indispensable component for planning in autonomous driving system. In this paper, we present \textbf{Map} \textbf{TR}ansformer, an end-to-end framework for online vectorized HD map construction. We propose a unified permutation-equivalent modeling approach, \ie, modeling map element as a point set with a group of equivalent permutations, which accurately describes the shape of map element and stabilizes the learning process. We design a hierarchical query embedding scheme to flexibly encode structured map information and perform hierarchical bipartite matching for map element learning. To speed up convergence, we further introduce auxiliary one-to-many matching and dense supervision. The proposed method well copes with various map elements with arbitrary shapes. It runs at real-time inference speed and achieves state-of-the-art performance on both nuScenes and Argoverse2 datasets. Abundant qualitative results show stable and robust map construction quality in complex and various driving scenes. Code and more demos are available at \url{https://github.com/hustvl/MapTR} for facilitating further studies and applications.

MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction

TL;DR

MapTRv2 introduces an end-to-end framework for online vectorized HD map construction that models map elements as permutation-equivalent point sets, enabling stable learning for arbitrarily shaped elements. It leverages a hierarchical query embedding and decoupled self-attention within a Transformer encoder-decoder to efficiently predict both undirected and directed map elements, supported by hierarchical bipartite matching and auxiliary dense supervision to accelerate convergence. The approach achieves real-time performance and state-of-the-art accuracy on nuScenes and Argoverse2, and extends to centerline learning and 3D map reconstruction. These advances offer a practical, scalable module for autonomous driving pipelines and downstream planning tasks.

Abstract

High-definition (HD) map provides abundant and precise static environmental information of the driving scene, serving as a fundamental and indispensable component for planning in autonomous driving system. In this paper, we present \textbf{Map} \textbf{TR}ansformer, an end-to-end framework for online vectorized HD map construction. We propose a unified permutation-equivalent modeling approach, \ie, modeling map element as a point set with a group of equivalent permutations, which accurately describes the shape of map element and stabilizes the learning process. We design a hierarchical query embedding scheme to flexibly encode structured map information and perform hierarchical bipartite matching for map element learning. To speed up convergence, we further introduce auxiliary one-to-many matching and dense supervision. The proposed method well copes with various map elements with arbitrary shapes. It runs at real-time inference speed and achieves state-of-the-art performance on both nuScenes and Argoverse2 datasets. Abundant qualitative results show stable and robust map construction quality in complex and various driving scenes. Code and more demos are available at \url{https://github.com/hustvl/MapTR} for facilitating further studies and applications.
Paper Structure (19 sections, 17 equations, 10 figures, 17 tables)

This paper contains 19 sections, 17 equations, 10 figures, 17 tables.

Figures (10)

  • Figure 1: Speed-accuracy trade-off comparisons. The proposed MapTRv2 outperforms previous state-of-the-art methods in terms of both speed (FPS) and accuracy (mAP). Compared with MapTR, MapTRv2 further improves performance by a large margin. The FPSs are measured on one NVIDIA RTX 3090.
  • Figure 2: Illustration of permutation-equivalent shape modeling. Map elements are geometrically abstracted and discretized into polylines and polygons. MapTRv2 models each map element with $(V, \Gamma)$ (a point set $V$ and a group of equivalent permutations $\Gamma$), avoiding ambiguity and stabilizing the learning process. A special case is, if the polyline element has a specific direction (e.g., centerline), $\Gamma$ includes only one permutation.
  • Figure 3: Typical cases for illustrating the ambiguity of map element in terms of start point and direction. Left: for polyline with unspecific direction (e.g., the lane divider between two opposite lanes), defining its direction is difficult. Both endpoints of the lane divider can be regarded as the start point and the point set can be organized in two directions. Right: for polygon (e.g., pedestrian crossing), each point of the polygon can be regarded as the start point, and the polygon can be connected in two opposite directions (counter-clockwise and clockwise). Note that some kinds of map elements (like centerline) have specific direction and have no ambiguity issue.
  • Figure 4: The overall architecture of MapTRv2. MapTRv2 adopts an encoder-decoder paradigm. The map encoder transforms sensor input to a unified BEV representation. The map decoder adopts a hierarchical query embedding scheme to explicitly encode map elements. The $L$ stacked Transformer decoder layers iteratively refine the predicted map elements. We propose several self-attention variants and cross-attention variants to efficiently update the query features. MapTRv2 is fully end-to-end. The pipeline is highly structured, compact and efficient.
  • Figure 5: Hierarchical bipartite matching. MapTRv2 performs instance-level matching to find optimal instance-level assignment $\hat{\pi}$, and performs point-level matching to find optimal point-to-point assignment $\hat{\gamma}$ (Sec. \ref{['sec:one2one_matching']}). Based on the optimal instance-level and point-level assignments ($\hat{\pi}$ and $\{\hat{\gamma}_i\}$), one-to-one set prediction loss (Sec. \ref{['sec:one2one_loss']}) is defined for end-to-end learning.
  • ...and 5 more figures