Table of Contents
Fetching ...

Depth as Points: Center Point-based Depth Estimation

Zhiheng Tu, Xinjian Huang, Yong He, Ruiyang Zhou, Bo Du, Weitao Wu

TL;DR

This work tackles the challenge of real-time monocular depth perception in urban autonomous driving by introducing CenterDepth, a center-point regression framework that couples object detection with localized depth prediction. Central to the approach are Center Point Regression for detecting object centers and Center FC-CRFs for efficient, global-information–driven depth propagation anchored at those centers, enabling accurate depth over distances up to $200$ meters without full-scene depth maps. To support training and evaluation, the authors build virDepth, a large virtual dataset generated via CARLA and UE4, providing synchronized RGB, depth, and semantic labels across diverse urban scenes. Empirical results show CenterDepth achieves high depth accuracy (e.g., $\delta_1$ approaching $0.989$ on virDepth) and favorable efficiency across backbones, outperforming state-of-the-art methods on virDepth, Virtual KITTI 2, KITTI-Depth, and KITTI-3D, while maintaining strong generalization and BEV-path-planning utility. These findings suggest a practical, scalable path to robust monocular depth perception for real-time autonomous driving applications.

Abstract

The perception of vehicles and pedestrians in urban scenarios is crucial for autonomous driving. This process typically involves complicated data collection, imposes high computational and hardware demands. To address these limitations, we first develop a highly efficient method for generating virtual datasets, which enables the creation of task- and scenario-specific datasets in a short time. Leveraging this method, we construct the virtual depth estimation dataset VirDepth, a large-scale, multi-task autonomous driving dataset. Subsequently, we propose CenterDepth, a lightweight architecture for monocular depth estimation that ensures high operational efficiency and exhibits superior performance in depth estimation tasks with highly imbalanced height-scale distributions. CenterDepth integrates global semantic information through the innovative Center FC-CRFs algorithm, aggregates multi-scale features based on object key points, and enables detection-based depth estimation of targets. Experiments demonstrate that our proposed method achieves superior performance in terms of both computational speed and prediction accuracy.

Depth as Points: Center Point-based Depth Estimation

TL;DR

This work tackles the challenge of real-time monocular depth perception in urban autonomous driving by introducing CenterDepth, a center-point regression framework that couples object detection with localized depth prediction. Central to the approach are Center Point Regression for detecting object centers and Center FC-CRFs for efficient, global-information–driven depth propagation anchored at those centers, enabling accurate depth over distances up to meters without full-scene depth maps. To support training and evaluation, the authors build virDepth, a large virtual dataset generated via CARLA and UE4, providing synchronized RGB, depth, and semantic labels across diverse urban scenes. Empirical results show CenterDepth achieves high depth accuracy (e.g., approaching on virDepth) and favorable efficiency across backbones, outperforming state-of-the-art methods on virDepth, Virtual KITTI 2, KITTI-Depth, and KITTI-3D, while maintaining strong generalization and BEV-path-planning utility. These findings suggest a practical, scalable path to robust monocular depth perception for real-time autonomous driving applications.

Abstract

The perception of vehicles and pedestrians in urban scenarios is crucial for autonomous driving. This process typically involves complicated data collection, imposes high computational and hardware demands. To address these limitations, we first develop a highly efficient method for generating virtual datasets, which enables the creation of task- and scenario-specific datasets in a short time. Leveraging this method, we construct the virtual depth estimation dataset VirDepth, a large-scale, multi-task autonomous driving dataset. Subsequently, we propose CenterDepth, a lightweight architecture for monocular depth estimation that ensures high operational efficiency and exhibits superior performance in depth estimation tasks with highly imbalanced height-scale distributions. CenterDepth integrates global semantic information through the innovative Center FC-CRFs algorithm, aggregates multi-scale features based on object key points, and enables detection-based depth estimation of targets. Experiments demonstrate that our proposed method achieves superior performance in terms of both computational speed and prediction accuracy.

Paper Structure

This paper contains 13 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of the Proposed System: The architecture consists of an obstacle detection network and a depth prediction module. The CenterDepth module performs localized depth prediction based on the object center predicted by the object detection task. The two tasks are jointly learned, with a shared feature extraction layer. Consequently, as the model receives target center information from the detector, it achieves improved accuracy in depth prediction.
  • Figure 2: The depth prediction results obtained using traditional depth estimation methods are shown. From left to right, each column represents the original images, the results from DepthAnything, Monodepth and DepthAnythingV2. Red indicates that the depth of the target is no longer different from the background depth, and green indicates that the depth of the target area is successfully predicted. As observed from the images, the target within the yellow circle in the original image becomes indistinguishable from the background in the traditional depth estimation methods due to its small size, leading to prediction failure.
  • Figure 3: The left panel illustrates the keypoint prediction of objects using heatmaps, where the intensity of each pixel represents the confidence of the corresponding keypoint location. The right panel depicts the principle of CenterCRFs, which adaptively allocates weights to pixels based on their distance to the keypoint. This mechanism enables keypoint-based feature aggregation by assigning higher weights to pixels in the vicinity of the keypoint, thereby enhancing the semantic consistency and spatial coherence of features around the target's central location.
  • Figure 4: A schematic of the virDepth. (a) shows the RGB images, (b) shows the semantic segmentation images, and (c) shows the depth images.
  • Figure 5: The training images were selected based on the given semantic segmentation maps, identifying the regions corresponding to vehicles and pedestrians. The depth information for these regions was extracted from the depth ground truth maps. We evaluate the vehicle and pedestrian targets within a 100-meter range.
  • ...and 3 more figures