Table of Contents
Fetching ...

Neural HD Map Generation from Multiple Vectorized Tiles Locally Produced by Autonomous Vehicles

Miao Fan, Yi Yao, Jianping Zhang, Xiangbo Song, Daihui Wu

TL;DR

This work tackles the challenge of building a globally consistent HD map from multiple vectorized tiles produced by autonomous vehicles, a task where prior online methods yield only local, ego-centric maps. It introduces GNMap, an end-to-end framework based on a shared multi-layer, attention-driven autoencoder trained in two phases: pretraining to complete masked tiles and finetuning to assign correct categories to map elements across tiles from multiple tours. The approach achieves superior performance, surpassing the current SOTA by more than 5% in F1 and surpassing a Gaussian Mixture Model baseline by over 10%, while remaining suitable for industrial deployment. The method is deployed at Navinfo, demonstrating practical impact by automatically constructing HD maps for autonomous driving in Mainland China and reducing manual intervention.

Abstract

High-definition (HD) map is a fundamental component of autonomous driving systems, as it can provide precise environmental information about driving scenes. Recent work on vectorized map generation could produce merely 65% local map elements around the ego-vehicle at runtime by one tour with onboard sensors, leaving a puzzle of how to construct a global HD map projected in the world coordinate system under high-quality standards. To address the issue, we present GNMap as an end-to-end generative neural network to automatically construct HD maps with multiple vectorized tiles which are locally produced by autonomous vehicles through several tours. It leverages a multi-layer and attention-based autoencoder as the shared network, of which parameters are learned from two different tasks (i.e., pretraining and finetuning, respectively) to ensure both the completeness of generated maps and the correctness of element categories. Abundant qualitative evaluations are conducted on a real-world dataset and experimental results show that GNMap can surpass the SOTA method by more than 5% F1 score, reaching the level of industrial usage with a small amount of manual modification. We have already deployed it at Navinfo Co., Ltd., serving as an indispensable software to automatically build HD maps for autonomous driving systems.

Neural HD Map Generation from Multiple Vectorized Tiles Locally Produced by Autonomous Vehicles

TL;DR

This work tackles the challenge of building a globally consistent HD map from multiple vectorized tiles produced by autonomous vehicles, a task where prior online methods yield only local, ego-centric maps. It introduces GNMap, an end-to-end framework based on a shared multi-layer, attention-driven autoencoder trained in two phases: pretraining to complete masked tiles and finetuning to assign correct categories to map elements across tiles from multiple tours. The approach achieves superior performance, surpassing the current SOTA by more than 5% in F1 and surpassing a Gaussian Mixture Model baseline by over 10%, while remaining suitable for industrial deployment. The method is deployed at Navinfo, demonstrating practical impact by automatically constructing HD maps for autonomous driving in Mainland China and reducing manual intervention.

Abstract

High-definition (HD) map is a fundamental component of autonomous driving systems, as it can provide precise environmental information about driving scenes. Recent work on vectorized map generation could produce merely 65% local map elements around the ego-vehicle at runtime by one tour with onboard sensors, leaving a puzzle of how to construct a global HD map projected in the world coordinate system under high-quality standards. To address the issue, we present GNMap as an end-to-end generative neural network to automatically construct HD maps with multiple vectorized tiles which are locally produced by autonomous vehicles through several tours. It leverages a multi-layer and attention-based autoencoder as the shared network, of which parameters are learned from two different tasks (i.e., pretraining and finetuning, respectively) to ensure both the completeness of generated maps and the correctness of element categories. Abundant qualitative evaluations are conducted on a real-world dataset and experimental results show that GNMap can surpass the SOTA method by more than 5% F1 score, reaching the level of industrial usage with a small amount of manual modification. We have already deployed it at Navinfo Co., Ltd., serving as an indispensable software to automatically build HD maps for autonomous driving systems.
Paper Structure (24 sections, 15 equations, 5 figures, 3 tables)

This paper contains 24 sections, 15 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of a snapshot of vectorized HD map. It is composed of static map elements, such as pedestrian crossings, lane dividers, road boundaries, etc., which are geometrically discretized into polylines or polygons.
  • Figure 2: The architecture of shared autoencoder employed by GNMap. It is a multi-layer generative neural network mainly composed of multi-head self-attention functions.
  • Figure 3: Illustration of the data processing pipeline at the pretraining phase, where the shared autoencoder is responsible for completing masked (gray-scaled) vectorized tiles.
  • Figure 4: Illustration of the data processing pipeline at finetuning phase, where the pretrained parameters are leveraged as initial weights of the shared autoencoder. It aims at assigning each pixel of map elements to the correct category.
  • Figure 5: An example on how to calculate $Precision$ (abbr. as $P$) and $Recall$ (abbr. as $R$). In this case, we have three map elements (two lane dividers and a road boundary). For lane dividers (colored by green), there are 7 predicted points/pixels and 4 ground-truth points/pixels. 3 of 7 are accepted as they locate within 0.5m of the ground-truth pixels. Therefore, $P_{div.} = 3/7$ and $R_{div.} = 3/4$. For the road boundary (colored by yellow), there are 8 predicted points/pixels and 5 ground-truth points/pixels. 4 of 8 are accepted as they locate within 0.5m of the ground-truth pixels. Therefore, $P_{bou.} = 4/8$ and $R_{bou.} = 4/5$.