Table of Contents
Fetching ...

Toward Accurate Camera-based 3D Object Detection via Cascade Depth Estimation and Calibration

Chaoqun Wang, Yiran Qin, Zijian Kang, Ningning Ma, Ruimao Zhang

TL;DR

The paper tackles depth-related challenges in camera-based 3D object detection by proposing a cascade framework with two depth-aware learning paradigms: depth estimation (DE) to improve feature lifting with absolute and relative depth supervision, and depth calibration (DC) to embed depth denoising for robust localization during training. Both DE and DC are integrated in an end-to-end training regime without adding inference cost, and demonstrate improvements on NuScenes, including state-of-the-art performance with strong backbones and across multiple detectors. Key contributions include the DE loss design with absolute $L_{adl}$ and relative $L_{rdl}$ terms, the DC mechanism with noised anchors and reconstruction loss $L_{rcl}$, and extensive ablations showing broad applicability and performance gains. The approach offers a practical path to more robust camera-based 3D detection by explicitly handling depth information at both feature lifting and localization stages, with significant implications for autonomous driving systems.

Abstract

Recent camera-based 3D object detection is limited by the precision of transforming from image to 3D feature spaces, as well as the accuracy of object localization within the 3D space. This paper aims to address such a fundamental problem of camera-based 3D object detection: How to effectively learn depth information for accurate feature lifting and object localization. Different from previous methods which directly predict depth distributions by using a supervised estimation model, we propose a cascade framework consisting of two depth-aware learning paradigms. First, a depth estimation (DE) scheme leverages relative depth information to realize the effective feature lifting from 2D to 3D spaces. Furthermore, a depth calibration (DC) scheme introduces depth reconstruction to further adjust the 3D object localization perturbation along the depth axis. In practice, the DE is explicitly realized by using both the absolute and relative depth optimization loss to promote the precision of depth prediction, while the capability of DC is implicitly embedded into the detection Transformer through a depth denoising mechanism in the training phase. The entire model training is accomplished through an end-to-end manner. We propose a baseline detector and evaluate the effectiveness of our proposal with +2.2%/+2.7% NDS/mAP improvements on NuScenes benchmark, and gain a comparable performance with 55.9%/45.7% NDS/mAP. Furthermore, we conduct extensive experiments to demonstrate its generality based on various detectors with about +2% NDS improvements.

Toward Accurate Camera-based 3D Object Detection via Cascade Depth Estimation and Calibration

TL;DR

The paper tackles depth-related challenges in camera-based 3D object detection by proposing a cascade framework with two depth-aware learning paradigms: depth estimation (DE) to improve feature lifting with absolute and relative depth supervision, and depth calibration (DC) to embed depth denoising for robust localization during training. Both DE and DC are integrated in an end-to-end training regime without adding inference cost, and demonstrate improvements on NuScenes, including state-of-the-art performance with strong backbones and across multiple detectors. Key contributions include the DE loss design with absolute and relative terms, the DC mechanism with noised anchors and reconstruction loss , and extensive ablations showing broad applicability and performance gains. The approach offers a practical path to more robust camera-based 3D detection by explicitly handling depth information at both feature lifting and localization stages, with significant implications for autonomous driving systems.

Abstract

Recent camera-based 3D object detection is limited by the precision of transforming from image to 3D feature spaces, as well as the accuracy of object localization within the 3D space. This paper aims to address such a fundamental problem of camera-based 3D object detection: How to effectively learn depth information for accurate feature lifting and object localization. Different from previous methods which directly predict depth distributions by using a supervised estimation model, we propose a cascade framework consisting of two depth-aware learning paradigms. First, a depth estimation (DE) scheme leverages relative depth information to realize the effective feature lifting from 2D to 3D spaces. Furthermore, a depth calibration (DC) scheme introduces depth reconstruction to further adjust the 3D object localization perturbation along the depth axis. In practice, the DE is explicitly realized by using both the absolute and relative depth optimization loss to promote the precision of depth prediction, while the capability of DC is implicitly embedded into the detection Transformer through a depth denoising mechanism in the training phase. The entire model training is accomplished through an end-to-end manner. We propose a baseline detector and evaluate the effectiveness of our proposal with +2.2%/+2.7% NDS/mAP improvements on NuScenes benchmark, and gain a comparable performance with 55.9%/45.7% NDS/mAP. Furthermore, we conduct extensive experiments to demonstrate its generality based on various detectors with about +2% NDS improvements.
Paper Structure (14 sections, 13 equations, 3 figures, 6 tables)

This paper contains 14 sections, 13 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The red curves indicate the predicted depth probability distribution of objects in red dots. By supervising relative depth in the DE scheme, we can optimize the distribution (from red to green curve), which is more accurate. In the DC scheme, we generate noised anchors (red cube) from the ground truth box (green cube) in 3D space. By reconstructing them, the detector obtains depth calibration capability.
  • Figure 2: The overall architecture of our proposed detector consists of three parts: feature extractor, FOV to BEV translation, and detection head. For a given multi-camera input, we extract 2D features via a shared encoder and project them to 3D spaces via predicted depth, and the generated BEV features are fed into the detection head for object localization and recognition. The $L_{adl}$ and $L_{det}$ indicate the absolute depth loss and detection loss while $L_{rdl}$ and $L_{rcl}$ are relative depth loss and reconstruction loss in our proposed DE and DC scheme. Best view in color and more details are referred in Sec. \ref{['sec:method']}.
  • Figure 3: Noised anchors generation process. Given a ground truth box in the ego coordinate system, we add depth, scale, and location noise to generate the noised reference anchors, which indicate $f_d$,$f_s$, and $f_l$ in Eqn. \ref{['eq:6']} respectively. We visualize the corresponding noised boxes in BEV and image space.