Table of Contents
Fetching ...

VRSO: Visual-Centric Reconstruction for Static Object Annotation

Chenyao Yu, Yingfeng Cai, Jiaxin Zhang, Hui Kong, Wei Sui, Cong Yang

TL;DR

VRSO is distinguished in low cost, high efficiency, and high quality: (1) It recovers static objects in 3D space with only camera images as input, and (2) manual annotation is barely involved since GT for SOD tasks is generated based on an automatic reconstruction and annotation pipeline.

Abstract

As a part of the perception results of intelligent driving systems, static object detection (SOD) in 3D space provides crucial cues for driving environment understanding. With the rapid deployment of deep neural networks for SOD tasks, the demand for high-quality training samples soars. The traditional, also reliable, way is manual labelling over the dense LiDAR point clouds and reference images. Though most public driving datasets adopt this strategy to provide SOD ground truth (GT), it is still expensive and time-consuming in practice. This paper introduces VRSO, a visual-centric approach for static object annotation. Experiments on the Waymo Open Dataset show that the mean reprojection error from VRSO annotation is only 2.6 pixels, around four times lower than the Waymo Open Dataset labels (10.6 pixels). VRSO is distinguished in low cost, high efficiency, and high quality: (1) It recovers static objects in 3D space with only camera images as input, and (2) manual annotation is barely involved since GT for SOD tasks is generated based on an automatic reconstruction and annotation pipeline.

VRSO: Visual-Centric Reconstruction for Static Object Annotation

TL;DR

VRSO is distinguished in low cost, high efficiency, and high quality: (1) It recovers static objects in 3D space with only camera images as input, and (2) manual annotation is barely involved since GT for SOD tasks is generated based on an automatic reconstruction and annotation pipeline.

Abstract

As a part of the perception results of intelligent driving systems, static object detection (SOD) in 3D space provides crucial cues for driving environment understanding. With the rapid deployment of deep neural networks for SOD tasks, the demand for high-quality training samples soars. The traditional, also reliable, way is manual labelling over the dense LiDAR point clouds and reference images. Though most public driving datasets adopt this strategy to provide SOD ground truth (GT), it is still expensive and time-consuming in practice. This paper introduces VRSO, a visual-centric approach for static object annotation. Experiments on the Waymo Open Dataset show that the mean reprojection error from VRSO annotation is only 2.6 pixels, around four times lower than the Waymo Open Dataset labels (10.6 pixels). VRSO is distinguished in low cost, high efficiency, and high quality: (1) It recovers static objects in 3D space with only camera images as input, and (2) manual annotation is barely involved since GT for SOD tasks is generated based on an automatic reconstruction and annotation pipeline.
Paper Structure (21 sections, 1 equation, 8 figures, 2 tables)

This paper contains 21 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Comparison between our proposed VRSO (green) and Waymo (red) annotations after reprojection from 3D space to 2D images. All images are from the Waymo Open Dataset (WOD). We can easily find the reprojection errors (false positives and false negatives, marked by yellow arrows) among the WOD annotations. For instance, the traffic signs are (1) missed in (a), (b), (d) and (e), (2) incorrectly labelled in (c), (e), and (h), (3) not tightly covered in (a), (b), (f) and (g). Differently, VRSO yields consistent and accurate annotation among all images, even in low-resolution (e), illuminate conditions (c) and (d). Best viewed in colour.
  • Figure 2: Traditional mei2022waymo and our VRSO annotation pipelines. Our proposed VRSO has lower cost and higher efficiency.
  • Figure 3: Our proposed VRSO mainly contains six steps: (1) With image sequence, Structure-from-Motion (SfM) obtains accurate ego poses and sparse 3D key points. The 2D-3D correspondence of these key points provides association information across frames. (2) Multi-Object Tracking and Segmentation (MOTS) obtains 2D static objects (detection and segmentation). (3) The initial 3D annotations are proposed based on an Iterative 2D/3D Association of the above 3D points and 2D boxes. (4) These proposals are further refined by splitting or merging in the instance levels. (5) Aiming at the consistency of reprojection, reduce the reprojection error and further optimize the 3D parameters (position, orientation, size). (6) Automatically annotate static objects based on the optimized 3D parameters and tracking information (yellow is VRSO, red is Waymo GT). Best viewed in color.
  • Figure 4: Visualization of annotations. Green and red boxes are the annotations from VRSO and WOD, respectively. We can find that annotations from our proposed VRSO are more accurate to the target than WOD, even in night time with poor illumination conditions.
  • Figure 5: Converting clip-wise VRSO annotation into frame-wise. $T^{C}_{i}$ and $T^{L}_{i}$ denote the camera and LiDAR coordinates at time $t$, respectively. $t=0,...,N$.
  • ...and 3 more figures