Table of Contents
Fetching ...

CAMAv2: A Vision-Centric Approach for Static Map Element Annotation

Shiyuan Chen, Jiaxin Zhang, Ruohong Mei, Yingfeng Cai, Haoran Yin, Tao Chen, Wei Sui, Cong Yang

TL;DR

CAMAv2 introduces a vision-centric pipeline that generates accurate, reprojection-consistent 3D HD map annotations from surround-view imagery without LiDAR. It fuses WIGO-based pose estimation, an odometry-guided SfM with multiple efficiency/robustness enhancements, and RoMe road-surface meshes, followed by a semi-automatic VMA for 3D map annotation with elevation. On nuScenes, CAMAv2 reduces semantic reprojection error from 8.03 to 4.96 pixels and improves MapTRv2's reprojection performance when trained with CAMAv2 data, while a multi-scene aggregation and parallel reconstruction approach delivers fivefold efficiency gains and better robustness. The approach generalizes to other datasets such as Waymo Open Dataset, supports long-tail and adverse-weather scenarios, and provides publicly available code and nuScenes-CAMAv2 annotations to accelerate 4D labeling for autonomous driving research.

Abstract

The recent development of online static map element (a.k.a. HD map) construction algorithms has raised a vast demand for data with ground truth annotations. However, available public datasets currently cannot provide high-quality training data regarding consistency and accuracy. For instance, the manual labelled (low efficiency) nuScenes still contains misalignment and inconsistency between the HD maps and images (e.g., around 8.03 pixels reprojection error on average). To this end, we present CAMAv2: a vision-centric approach for Consistent and Accurate Map Annotation. Without LiDAR inputs, our proposed framework can still generate high-quality 3D annotations of static map elements. Specifically, the annotation can achieve high reprojection accuracy across all surrounding cameras and is spatial-temporal consistent across the whole sequence. We apply our proposed framework to the popular nuScenes dataset to provide efficient and highly accurate annotations. Compared with the original nuScenes static map element, our CAMAv2 annotations achieve lower reprojection errors (e.g., 4.96 vs. 8.03 pixels). Models trained with annotations from CAMAv2 also achieve lower reprojection errors (e.g., 5.62 vs. 8.43 pixels).

CAMAv2: A Vision-Centric Approach for Static Map Element Annotation

TL;DR

CAMAv2 introduces a vision-centric pipeline that generates accurate, reprojection-consistent 3D HD map annotations from surround-view imagery without LiDAR. It fuses WIGO-based pose estimation, an odometry-guided SfM with multiple efficiency/robustness enhancements, and RoMe road-surface meshes, followed by a semi-automatic VMA for 3D map annotation with elevation. On nuScenes, CAMAv2 reduces semantic reprojection error from 8.03 to 4.96 pixels and improves MapTRv2's reprojection performance when trained with CAMAv2 data, while a multi-scene aggregation and parallel reconstruction approach delivers fivefold efficiency gains and better robustness. The approach generalizes to other datasets such as Waymo Open Dataset, supports long-tail and adverse-weather scenarios, and provides publicly available code and nuScenes-CAMAv2 annotations to accelerate 4D labeling for autonomous driving research.

Abstract

The recent development of online static map element (a.k.a. HD map) construction algorithms has raised a vast demand for data with ground truth annotations. However, available public datasets currently cannot provide high-quality training data regarding consistency and accuracy. For instance, the manual labelled (low efficiency) nuScenes still contains misalignment and inconsistency between the HD maps and images (e.g., around 8.03 pixels reprojection error on average). To this end, we present CAMAv2: a vision-centric approach for Consistent and Accurate Map Annotation. Without LiDAR inputs, our proposed framework can still generate high-quality 3D annotations of static map elements. Specifically, the annotation can achieve high reprojection accuracy across all surrounding cameras and is spatial-temporal consistent across the whole sequence. We apply our proposed framework to the popular nuScenes dataset to provide efficient and highly accurate annotations. Compared with the original nuScenes static map element, our CAMAv2 annotations achieve lower reprojection errors (e.g., 4.96 vs. 8.03 pixels). Models trained with annotations from CAMAv2 also achieve lower reprojection errors (e.g., 5.62 vs. 8.43 pixels).
Paper Structure (21 sections, 6 equations, 13 figures, 6 tables, 2 algorithms)

This paper contains 21 sections, 6 equations, 13 figures, 6 tables, 2 algorithms.

Figures (13)

  • Figure 1: Comparison of reprojection consistency and accuracy. The top and bottom lines present HD map reprojection of the original nuScenes (semi-automatic) and our CAMAv2 (pure-automatic) annotations. The yellow and white lines represent road boundaries and lane dividers, respectively. The original nuScenes has inconsistent road element annotations concerning the actual environments. For instance, there is no lane marking of the bikeway in images (a) and (b), which are wrongly indicated on the HD maps (a.k.a. False Negative). Besides, images (a) and (b) present lane divider and road boundary, but there is no HD map marking in the corresponding area (a.k.a. False Positive). Due to the lack of elevation information, the reprojected road elements are misaligned with the image (c). In contrast, the HD map from our proposed method shows better reprojection accuracy (f) and consistency (d, e). Best viewed in colour.
  • Figure 2: Illustration of our proposed reconstruction and annotation pipeline. The surround view images and auxiliary sensor data are fed into our proposed odometry-guided SfM to obtain highly accurate ego vehicle poses and sparse 3D points. A road surface mesh reconstruction called road surface reconstruction via mesh representations (RoMe) is applied to build dense 3D road surfaces with semantic labels. Finally, a vectorized map annotation (VMA) system is applied to produce a 3D HD map required by the perception algorithm as training data.
  • Figure 3: WIGO algorithm. GNSS, IMU, and wheel are fused in pose graph optimization to obtain accurate global poses.
  • Figure 4: Illustration of homography-guided spatial pairs (HSP). The potential matching image pairs can be filtered by computing the visual overlap between different cameras, paying more attention to the road surface.
  • Figure 5: Using our proposed method, we reconstruct an HD map of a $300m \times 300m$ site with multiple scenes in nuScenes. (a) Semantic map in BEV, purple, pink, and white correspond to road surface, road teeth, and lane marking, respectively. (b) Photometric map in BEV. (c) Elevation visualization in hotmap, brighter indicates higher.
  • ...and 8 more figures