Table of Contents
Fetching ...

A Vision-Centric Approach for Static Map Element Annotation

Jiaxin Zhang, Shiyuan Chen, Haoran Yin, Ruohong Mei, Xuan Liu, Cong Yang, Qian Zhang, Wei Sui

TL;DR

This work presents CAMA: a vision-centric approach for Consistent and Accurate Map Annotation, which can achieve high reprojection accuracy across all surrounding cameras and is spatial-temporal consistent across the whole sequence.

Abstract

The recent development of online static map element (a.k.a. HD Map) construction algorithms has raised a vast demand for data with ground truth annotations. However, available public datasets currently cannot provide high-quality training data regarding consistency and accuracy. To this end, we present CAMA: a vision-centric approach for Consistent and Accurate Map Annotation. Without LiDAR inputs, our proposed framework can still generate high-quality 3D annotations of static map elements. Specifically, the annotation can achieve high reprojection accuracy across all surrounding cameras and is spatial-temporal consistent across the whole sequence. We apply our proposed framework to the popular nuScenes dataset to provide efficient and highly accurate annotations. Compared with the original nuScenes static map element, models trained with annotations from CAMA achieve lower reprojection errors (e.g., 4.73 vs. 8.03 pixels).

A Vision-Centric Approach for Static Map Element Annotation

TL;DR

This work presents CAMA: a vision-centric approach for Consistent and Accurate Map Annotation, which can achieve high reprojection accuracy across all surrounding cameras and is spatial-temporal consistent across the whole sequence.

Abstract

The recent development of online static map element (a.k.a. HD Map) construction algorithms has raised a vast demand for data with ground truth annotations. However, available public datasets currently cannot provide high-quality training data regarding consistency and accuracy. To this end, we present CAMA: a vision-centric approach for Consistent and Accurate Map Annotation. Without LiDAR inputs, our proposed framework can still generate high-quality 3D annotations of static map elements. Specifically, the annotation can achieve high reprojection accuracy across all surrounding cameras and is spatial-temporal consistent across the whole sequence. We apply our proposed framework to the popular nuScenes dataset to provide efficient and highly accurate annotations. Compared with the original nuScenes static map element, models trained with annotations from CAMA achieve lower reprojection errors (e.g., 4.73 vs. 8.03 pixels).
Paper Structure (10 sections, 5 figures, 2 tables)

This paper contains 10 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Reprojection consistency and accuracy comparison. The top and bottom lines show HD map reprojection images and zoom-in details of nuScenes and our proposed method, respectively. The yellow dots represent road teeth, and the white dots represent lane dividers. The nuScenes HD Map has some inconsistent elements with respect to the actual road environments, including false positive (FP) and false negative (FN) road element annotation. For example, the image shows no lane marking between the bicycle and vehicle lane, but the HD Map indicates a lane divider (a,b). The image shows ped-crossing marking but no HD Map marking in the corresponding area (e, f). In contrast, the HD Map from our proposed method shows better reprojection accuracy (c, d) and consistency.
  • Figure 2: Illustration of our proposed reconstruction and annotation pipeline. The surround images and auxiliary sensor data are fed into our proposed odometry-guided SfM to obtain highly accurate ego vehicle poses and sparse 3D points. A road surface mesh reconstruction called RoMe is applied to build dense 3D road surfaces with semantic labels. Finally, a vectorized map annotation (VMA) System is applied to produce a 3D HD map required by the perception algorithm as training data.
  • Figure 3: Sparse 3D lane point clouds from the OpenLane V1 dataset. It uses LiDAR point clouds projection to generate 3D lane annotations. The red, blue, and yellow points refer to road teeth, lane dividers, and solid yellow lane marks, respectively. The side view shows that the elevation noises are around the meter level. The zoom-in view also shows that noises are distributed in all directions.
  • Figure 4: Reconstructed HD Map of scene-0828 from nuScenes using our proposed method. (a) Semantic map in BEV, purple, pink, and white correspond to road surface, road teeth, and lane marking, respectively. (b) Photometric map in BEV. (3) Elevation visualization in hotmap, brighter indicates higher.
  • Figure :