Table of Contents
Fetching ...

Improving Distant 3D Object Detection Using 2D Box Supervision

Zetong Yang, Zhiding Yu, Chris Choy, Renhao Wang, Anima Anandkumar, Jose M. Alvarez

TL;DR

This work tackles the difficulty of long-range camera-based 3D object detection by removing the reliance on distant 3D annotations. It introduces LR3D, a framework that leverages only 2D box supervision for distant objects and learns an implicit inverse mapping via IP-Head to predict depth conditioned on object size and orientation, with dynamic weights generated per instance. A projection augmentation strategy and a long-range teacher pipeline enable robust training and transfer of depth-consistent predictions to BEV detectors, while a new Long-range Detection Score (LDS) provides a more informative evaluation for distant objects. Across multiple datasets, LR3D achieves substantial improvements in distant-object detection and often approaches the performance of fully 3D-supervised methods, illustrating a scalable path toward annotating distant scenes with minimal 2D labels.

Abstract

Improving the detection of distant 3d objects is an important yet challenging task. For camera-based 3D perception, the annotation of 3d bounding relies heavily on LiDAR for accurate depth information. As such, the distance of annotation is often limited due to the sparsity of LiDAR points on distant objects, which hampers the capability of existing detectors for long-range scenarios. We address this challenge by considering only 2D box supervision for distant objects since they are easy to annotate. We propose LR3D, a framework that learns to recover the missing depth of distant objects. LR3D adopts an implicit projection head to learn the generation of mapping between 2D boxes and depth using the 3D supervision on close objects. This mapping allows the depth estimation of distant objects conditioned on their 2D boxes, making long-range 3D detection with 2D supervision feasible. Experiments show that without distant 3D annotations, LR3D allows camera-based methods to detect distant objects (over 200m) with comparable accuracy to full 3D supervision. Our framework is general, and could widely benefit 3D detection methods to a large extent.

Improving Distant 3D Object Detection Using 2D Box Supervision

TL;DR

This work tackles the difficulty of long-range camera-based 3D object detection by removing the reliance on distant 3D annotations. It introduces LR3D, a framework that leverages only 2D box supervision for distant objects and learns an implicit inverse mapping via IP-Head to predict depth conditioned on object size and orientation, with dynamic weights generated per instance. A projection augmentation strategy and a long-range teacher pipeline enable robust training and transfer of depth-consistent predictions to BEV detectors, while a new Long-range Detection Score (LDS) provides a more informative evaluation for distant objects. Across multiple datasets, LR3D achieves substantial improvements in distant-object detection and often approaches the performance of fully 3D-supervised methods, illustrating a scalable path toward annotating distant scenes with minimal 2D labels.

Abstract

Improving the detection of distant 3d objects is an important yet challenging task. For camera-based 3D perception, the annotation of 3d bounding relies heavily on LiDAR for accurate depth information. As such, the distance of annotation is often limited due to the sparsity of LiDAR points on distant objects, which hampers the capability of existing detectors for long-range scenarios. We address this challenge by considering only 2D box supervision for distant objects since they are easy to annotate. We propose LR3D, a framework that learns to recover the missing depth of distant objects. LR3D adopts an implicit projection head to learn the generation of mapping between 2D boxes and depth using the 3D supervision on close objects. This mapping allows the depth estimation of distant objects conditioned on their 2D boxes, making long-range 3D detection with 2D supervision feasible. Experiments show that without distant 3D annotations, LR3D allows camera-based methods to detect distant objects (over 200m) with comparable accuracy to full 3D supervision. Our framework is general, and could widely benefit 3D detection methods to a large extent.
Paper Structure (12 sections, 6 equations, 8 figures, 4 tables)

This paper contains 12 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Upper: Without 3D annotations beyond 40m, LR3D enables the predictions of 3D boxes for extremely distant objects over 200m (right, in green) based on the inputs of image and 2D box (left). Lower: Existing methods fail to detect 3D objects beyond the 3D supervision range, e.g., 40m for this case. With LR3D, these remote missing objects are well detected.
  • Figure 2: Illustration of LR3D which detects 3D boxes for both close and distant objects using the supervision of close 2D/3D and distant 2D bounding box annotations.
  • Figure 3: Illustration of IP-Head. We use an MLP $f^{(\theta)}$ to fit the implicit function from 2D box to 3D depth, of which the weights $\theta$ are dynamically determined by instance features including information of size and orientation.
  • Figure 4: Illustration of the training and testing pipeline of IP-Head. (a). Training: During training, we use 2D/3D annotation pairs of close objects to supervise $f_g$ to generate dynamic weights of MLP $f^{(\theta)}$ which models the transformation of target 3D object from 2D box to corresponding depth in Eq. \ref{['eq:ipl_f_inverse_sim']}. (b). Testing: During testing, we use a 2D detection head (2D Det. Head) $f_{2d}$ to generate 2D detection results for all objects. They are then transferred to corresponding depth by IP-Head.
  • Figure 5: Illustration of deploying IP-Head to monocular 3D detectors: (a) FCOS3D; (b) FastRCNN3D.
  • ...and 3 more figures