Table of Contents
Fetching ...

LET-3D-AP: Longitudinal Error Tolerant 3D Average Precision for Camera-Only 3D Detection

Wei-Chih Hung, Vincent Casser, Henrik Kretzschmar, Jyh-Jing Hwang, Dragomir Anguelov

Abstract

The 3D Average Precision (3D AP) relies on the intersection over union between predictions and ground truth objects. However, camera-only detectors have limited depth accuracy, which may cause otherwise reasonable predictions that suffer from such longitudinal localization errors to be treated as false positives. We therefore propose variants of the 3D AP metric to be more permissive with respect to depth estimation errors. Specifically, our novel longitudinal error tolerant metrics, LET-3D-AP and LET-3D-APL, allow longitudinal localization errors of the prediction boxes up to a given tolerance. To evaluate the proposed metrics, we also construct a new test set for the Waymo Open Dataset, tailored to camera-only 3D detection methods. Surprisingly, we find that state-of-the-art camera-based detectors can outperform popular LiDAR-based detectors with our new metrics past at 10% depth error tolerance, suggesting that existing camera-based detectors already have the potential to surpass LiDAR-based detectors in downstream applications. We believe the proposed metrics and the new benchmark dataset will facilitate advances in the field of camera-only 3D detection by providing more informative signals that can better indicate the system-level performance.

LET-3D-AP: Longitudinal Error Tolerant 3D Average Precision for Camera-Only 3D Detection

Abstract

The 3D Average Precision (3D AP) relies on the intersection over union between predictions and ground truth objects. However, camera-only detectors have limited depth accuracy, which may cause otherwise reasonable predictions that suffer from such longitudinal localization errors to be treated as false positives. We therefore propose variants of the 3D AP metric to be more permissive with respect to depth estimation errors. Specifically, our novel longitudinal error tolerant metrics, LET-3D-AP and LET-3D-APL, allow longitudinal localization errors of the prediction boxes up to a given tolerance. To evaluate the proposed metrics, we also construct a new test set for the Waymo Open Dataset, tailored to camera-only 3D detection methods. Surprisingly, we find that state-of-the-art camera-based detectors can outperform popular LiDAR-based detectors with our new metrics past at 10% depth error tolerance, suggesting that existing camera-based detectors already have the potential to surpass LiDAR-based detectors in downstream applications. We believe the proposed metrics and the new benchmark dataset will facilitate advances in the field of camera-only 3D detection by providing more informative signals that can better indicate the system-level performance.
Paper Structure (18 sections, 11 equations, 6 figures, 2 tables)

This paper contains 18 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Evaluating camera-only 3D detections when using the 3D Average Precision (3D AP) metric (left) and when using the proposed longitudinal error tolerant, LET-3D-AP(L) metric (right). The figures depict the bipartite matching between the detections (green) and the ground truth objects (black). Regular 3D AP matching (left) is based on the intersection over union (IoU) values and cannot match the detections that suffer from longitudinal localization errors, though the detection is reasonable and can provide useful signals to down stream modules. In contrast to this, the proposed LET-3D-AP(L), shown on the right, is more permissive by shifting the predictions to mitigate the longitudinal localization errors. We show the shifted predictions in blue, which are used for computing the longitudinal error tolerant intersection over union (LET-IoU). To account for the used longitudinal tolerance, we propose longitudinal affinity (LA) as a measure of how close the original prediction is to the ground truth in the longitudinal direction.
  • Figure 2: Breakdown of the localization error. We decompose the 3D detection localization error into a lateral error and a longitudinal error. We find that the longitudinal error is more prominent in camera-only 3D detection. We therefore propose longitudinal error tolerant (LET) metrics that are more permissive with respect to the longitudinal localization error.
  • Figure 3: Computing LET-IoU. Given a predicted object and a ground truth object to be matched with, we move the predicted object along the line of sight to obtain minimal distance to the ground truth center. We then compute the LET-IoU as the 3D-IoU between the aligned predicted object and the ground truth object.
  • Figure 4: An example of a matched detection using LET-IoU. The green box denotes the detection, the red box denotes the ground truth, and the blue box denotes the longitudinal aligned detection as stated in Section \ref{['subsec:let_iou']}.We also show the connections between matched prediction boxes and aligned boxes using a purple connector.
  • Figure 5: CDFs of Box Shifts. We visualize the shift between box and camera_synced_box based on camera types (left) and select object types (right) on the validation set.
  • ...and 1 more figures