Table of Contents
Fetching ...

Better Monocular 3D Detectors with LiDAR from the Past

Yurong You, Cheng Perng Phoo, Carlos Andres Diaz-Ruiz, Katie Z Luo, Wei-Lun Chao, Mark Campbell, Bharath Hariharan, Kilian Q Weinberger

TL;DR

This work addresses the fundamental depth-ambiguity challenge in monocular 3D detection by leveraging unlabeled LiDAR data from past traversals through AsyncDepth. The method creates asynchronous depth features by densifying past LiDAR point clouds, projecting them into current camera views to form depth maps, and learning depth-aware representations that are fused with current image features in an end-to-end framework. Across Lyft L5 and Ithaca365, AsyncDepth consistently improves two representative monocular detectors (FCOS3D and Lift-Splat LSS) with low latency and modest storage costs, achieving up to 9.5 mAP gains in far-range detections. The results demonstrate practical potential for community-based LiDAR data sharing to upgrade camera-only perception, enabling cheaper autonomous systems without sacrificing performance.

Abstract

Accurate 3D object detection is crucial to autonomous driving. Though LiDAR-based detectors have achieved impressive performance, the high cost of LiDAR sensors precludes their widespread adoption in affordable vehicles. Camera-based detectors are cheaper alternatives but often suffer inferior performance compared to their LiDAR-based counterparts due to inherent depth ambiguities in images. In this work, we seek to improve monocular 3D detectors by leveraging unlabeled historical LiDAR data. Specifically, at inference time, we assume that the camera-based detectors have access to multiple unlabeled LiDAR scans from past traversals at locations of interest (potentially from other high-end vehicles equipped with LiDAR sensors). Under this setup, we proposed a novel, simple, and end-to-end trainable framework, termed AsyncDepth, to effectively extract relevant features from asynchronous LiDAR traversals of the same location for monocular 3D detectors. We show consistent and significant performance gain (up to 9 AP) across multiple state-of-the-art models and datasets with a negligible additional latency of 9.66 ms and a small storage cost.

Better Monocular 3D Detectors with LiDAR from the Past

TL;DR

This work addresses the fundamental depth-ambiguity challenge in monocular 3D detection by leveraging unlabeled LiDAR data from past traversals through AsyncDepth. The method creates asynchronous depth features by densifying past LiDAR point clouds, projecting them into current camera views to form depth maps, and learning depth-aware representations that are fused with current image features in an end-to-end framework. Across Lyft L5 and Ithaca365, AsyncDepth consistently improves two representative monocular detectors (FCOS3D and Lift-Splat LSS) with low latency and modest storage costs, achieving up to 9.5 mAP gains in far-range detections. The results demonstrate practical potential for community-based LiDAR data sharing to upgrade camera-only perception, enabling cheaper autonomous systems without sacrificing performance.

Abstract

Accurate 3D object detection is crucial to autonomous driving. Though LiDAR-based detectors have achieved impressive performance, the high cost of LiDAR sensors precludes their widespread adoption in affordable vehicles. Camera-based detectors are cheaper alternatives but often suffer inferior performance compared to their LiDAR-based counterparts due to inherent depth ambiguities in images. In this work, we seek to improve monocular 3D detectors by leveraging unlabeled historical LiDAR data. Specifically, at inference time, we assume that the camera-based detectors have access to multiple unlabeled LiDAR scans from past traversals at locations of interest (potentially from other high-end vehicles equipped with LiDAR sensors). Under this setup, we proposed a novel, simple, and end-to-end trainable framework, termed AsyncDepth, to effectively extract relevant features from asynchronous LiDAR traversals of the same location for monocular 3D detectors. We show consistent and significant performance gain (up to 9 AP) across multiple state-of-the-art models and datasets with a negligible additional latency of 9.66 ms and a small storage cost.
Paper Structure (16 sections, 4 equations, 5 figures, 14 tables)

This paper contains 16 sections, 4 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Can past LiDAR traversals help monocular 3D object detection? Here we show a current image (left) and an asynchronous depth map rendered from a past LiDAR traversal (right). The asynchronous depth map provides accurate depth for background regions (red arrows) and helps the monocular model disambiguate foreground objects in current scene (blue arrows).
  • Figure 2: Overview of AsyncDepth. It consists of three parts: (top left) general "featurize-then-detect" pipeline for monocular 3D detection; (bottom) extracting asynchronous depth features from past LiDAR traversals of the same location; (top right) fusing the image features with AsyncDepth features. Please refer to \ref{['sec:method']} symbol definitions.
  • Figure S1: Qualitative visualizations of AsyncDepth. We visualize 3D detection in the monocular image on the Ithaca365 dataset. Ground truth boxes are shown in green, baseline model predictions are shown in orange, and AsyncDepth predictions are shown in blue. We also include --for visualization purposes only-- the detections in 3D overlaid with the LiDAR point cloud (note: these are not given as the model inputs). Observe that, in the image, the 3D detections are ambiguous and the depth is incorrect in the baseline models. AsyncDepth is able to correct the depth to produce more accurate 3D detections.
  • Figure S2: Visualizing the depth predictions on the Lift-Splat model. We show the current image (top left) and two of the corresponding asynchronous depth maps (top right). On the bottom, we show the predicted depth inside the Lift-Splat model with and without AsyncDepth. We show the ground-truth synchronous depth map on the bottom right as a reference. The colorbar indicates the corresponding depth in meters. Best viewed in color.
  • Figure S3: Precision-recall curve comparing the baseline model and AsyncDepth variant for car detection on Ithaca365 (distance threshold = 1m). With AsyncDepth, the detector makes more precise predictions while maintaining similar recall.