AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features

Ruochen Zhang; Hyeung-Sik Choi; Dongwook Jung; Phan Huy Nam Anh; Sang-Ki Jeong; Zihao Zhu

AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features

Ruochen Zhang, Hyeung-Sik Choi, Dongwook Jung, Phan Huy Nam Anh, Sang-Ki Jeong, Zihao Zhu

TL;DR

AuxDepthNet tackles monocular 3D object detection by removing the need for external depth maps. It introduces the Auxiliary Depth Feature (ADF) and Depth Position Mapping (DPM) modules within a DepthFusion Transformer to implicitly learn depth-sensitive features and depth-position cues. The framework achieves competitive KITTI results in 3D and BEV detection while maintaining real-time performance, as demonstrated by AP$_{3D}$ and AP$_{BEV}$ scores across Easy/Moderate/Hard. Ablation studies confirm the contributions of depth prototypes, depth-guided queries, and backbone choices, supporting robust depth-aware reasoning with efficient computation.

Abstract

Monocular 3D object detection is a challenging task in autonomous systems due to the lack of explicit depth information in single-view images. Existing methods often depend on external depth estimators or expensive sensors, which increase computational complexity and hinder real-time performance. To overcome these limitations, we propose AuxDepthNet, an efficient framework for real-time monocular 3D object detection that eliminates the reliance on external depth maps or pre-trained depth models. AuxDepthNet introduces two key components: the Auxiliary Depth Feature (ADF) module, which implicitly learns depth-sensitive features to improve spatial reasoning and computational efficiency, and the Depth Position Mapping (DPM) module, which embeds depth positional information directly into the detection process to enable accurate object localization and 3D bounding box regression. Leveraging the DepthFusion Transformer architecture, AuxDepthNet globally integrates visual and depth-sensitive features through depth-guided interactions, ensuring robust and efficient detection. Extensive experiments on the KITTI dataset show that AuxDepthNet achieves state-of-the-art performance, with $\text{AP}_{3D}$ scores of 24.72\% (Easy), 18.63\% (Moderate), and 15.31\% (Hard), and $\text{AP}_{\text{BEV}}$ scores of 34.11\% (Easy), 25.18\% (Moderate), and 21.90\% (Hard) at an IoU threshold of 0.7.

AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features

TL;DR

and AP

scores across Easy/Moderate/Hard. Ablation studies confirm the contributions of depth prototypes, depth-guided queries, and backbone choices, supporting robust depth-aware reasoning with efficient computation.

Abstract

scores of 24.72\% (Easy), 18.63\% (Moderate), and 15.31\% (Hard), and

scores of 34.11\% (Easy), 25.18\% (Moderate), and 21.90\% (Hard) at an IoU threshold of 0.7.

Paper Structure (27 sections, 10 equations, 6 figures, 3 tables)

This paper contains 27 sections, 10 equations, 6 figures, 3 tables.

Introduction
Related Work
Monocular 3D Object Detection Methods
Transformer in Monocular 3D Object Detection
Proposed Method
Overview
Depth-Sensitive Feature Enhancement
Extracting Foundational depth-sensitive Features.
Depth-Sensitive Prototype Representation module.
Feature enhancement with depth prototype.
Extracting Foundational Depth-Sensitive Features
Transformer Encoder.
Transformer Decoder.
Depth Position Mapping (DPM) module.
Loss Function
...and 12 more sections

Figures (6)

Figure 1: Representative Depth-Assisted Monocular 3D Object Detection Frameworks. (a) Depth estimation methods use monocular inputs to construct pseudo-LiDAR data, enabling LiDAR-style 3D detectors ma2019accuratewang2019pseudoweng2019monocular. (b) Fusion-based methods combine visual and depth features to improve object detection accuracy ding2020learningouyang2020dynamicwang2021depth. (c) Our AuxDepthNet leverages depth guidance during training to develop depth-sensitive features and performs end-to-end 3D object detection without requiring external depth estimators.
Figure 2: The overall framework of our proposed method. AuxDepthNet enhances monocular 3D object detection by integrating depth-sensitive, context-sensitive, and depth-guided features. The Auxiliary Depth Feature (ADF) module encodes depth-related cues without pre-computed depth maps. Context-sensitive features are refined by a feature pyramid and DepthFusion Transformer (DFT), providing semantic and spatial context. The Depth Position Mapping (DPM) module embeds depth-based positional information for precise 3D localization. This integration captures local and global spatial relationships efficiently, delivering robust 2D and 3D detection.
Figure 3: Architecture of the Auxiliary Depth Feature Module (a)Generate the initial depth-sensitive feature $F_{\text{init}}$ and determine the depth distribution $P_{\text{depth}}$. (b) P[d] represents the feature representation of the depth prototype. (c) The depth prototype enhanced feature $F_{\text{enhanced}}$ is generated and fused with $F_{\text{init}}$.
Figure 4: Overview of the proposed Depth Position Mapping (DPM) module. This process aligns spatial features with depth information, enhancing the model's 3D geometric understanding.
Figure 5: Comparison of AP$_{\text{3D}}$ detection accuracy for the Car category on the KITTI validation set using different dilation rates in the dilated convolution of the ADF module.Comparison of AP$_{\text{3D}}$ detection accuracy for the Car category on the KITTI validation set, with standard convolution replaced by dilated convolutions using dilation rates of 2, 4, 8, and 16 in the ADF module.
...and 1 more figures

AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features

TL;DR

Abstract

AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features

Authors

TL;DR

Abstract

Table of Contents

Figures (6)