Table of Contents
Fetching ...

MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors

Fanqi Pu, Yifan Wang, Jiru Deng, Wenming Yang

TL;DR

MonoDGP addresses monocular 3D object detection by introducing geometry-error priors that modify the projection-based depth, mitigating depth-uncertainty without multi-depth branches. It decouples a 2D visual decoder from a 3D depth-guided decoder and adds a Region Segmentation Head to sharpen foreground features and provide segment embeddings for improved context. The approach yields state-of-the-art results on KITTI without extra data and demonstrates robust convergence and generalization, with ablations confirming the benefits of decoupled queries, RSH, and geometry-error depth. This work offers a practical, efficient pathway for improving monocular 3D perception in autonomous systems and can extend to dense depth map prediction within target regions.

Abstract

Perspective projection has been extensively utilized in monocular 3D object detection methods. It introduces geometric priors from 2D bounding boxes and 3D object dimensions to reduce the uncertainty of depth estimation. However, due to depth errors originating from the object's visual surface, the height of the bounding box often fails to represent the actual projected central height, which undermines the effectiveness of geometric depth. Direct prediction for the projected height unavoidably results in a loss of 2D priors, while multi-depth prediction with complex branches does not fully leverage geometric depth. This paper presents a Transformer-based monocular 3D object detection method called MonoDGP, which adopts perspective-invariant geometry errors to modify the projection formula. We also try to systematically discuss and explain the mechanisms and efficacy behind geometry errors, which serve as a simple but effective alternative to multi-depth prediction. Additionally, MonoDGP decouples the depth-guided decoder and constructs a 2D decoder only dependent on visual features, providing 2D priors and initializing object queries without the disturbance of 3D detection. To further optimize and fine-tune input tokens of the transformer decoder, we also introduce a Region Segment Head (RSH) that generates enhanced features and segment embeddings. Our monocular method demonstrates state-of-the-art performance on the KITTI benchmark without extra data. Code is available at https://github.com/PuFanqi23/MonoDGP.

MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors

TL;DR

MonoDGP addresses monocular 3D object detection by introducing geometry-error priors that modify the projection-based depth, mitigating depth-uncertainty without multi-depth branches. It decouples a 2D visual decoder from a 3D depth-guided decoder and adds a Region Segmentation Head to sharpen foreground features and provide segment embeddings for improved context. The approach yields state-of-the-art results on KITTI without extra data and demonstrates robust convergence and generalization, with ablations confirming the benefits of decoupled queries, RSH, and geometry-error depth. This work offers a practical, efficient pathway for improving monocular 3D perception in autonomous systems and can extend to dense depth map prediction within target regions.

Abstract

Perspective projection has been extensively utilized in monocular 3D object detection methods. It introduces geometric priors from 2D bounding boxes and 3D object dimensions to reduce the uncertainty of depth estimation. However, due to depth errors originating from the object's visual surface, the height of the bounding box often fails to represent the actual projected central height, which undermines the effectiveness of geometric depth. Direct prediction for the projected height unavoidably results in a loss of 2D priors, while multi-depth prediction with complex branches does not fully leverage geometric depth. This paper presents a Transformer-based monocular 3D object detection method called MonoDGP, which adopts perspective-invariant geometry errors to modify the projection formula. We also try to systematically discuss and explain the mechanisms and efficacy behind geometry errors, which serve as a simple but effective alternative to multi-depth prediction. Additionally, MonoDGP decouples the depth-guided decoder and constructs a 2D decoder only dependent on visual features, providing 2D priors and initializing object queries without the disturbance of 3D detection. To further optimize and fine-tune input tokens of the transformer decoder, we also introduce a Region Segment Head (RSH) that generates enhanced features and segment embeddings. Our monocular method demonstrates state-of-the-art performance on the KITTI benchmark without extra data. Code is available at https://github.com/PuFanqi23/MonoDGP.

Paper Structure

This paper contains 22 sections, 27 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Comparison with MonoDETR. MonoDGP employs an RSH module for enhanced features and segment embeddings, along with an independent visual decoder for 2D query initialization. We further propose a geometry error prior that converts uneven depth distribution into more concentrated error distribution.
  • Figure 2: Overall architecture of our MonoDGP. The network comprises three components: feature extraction and enhancement, transformer encoder-decoder and object detection heads. The parallel visual and depth branches within the transformer are represented by yellow and blue, respectively. The predicted depth is separated into geometric depth $Z_{geo}$, obtained from the perspective projection formula, and depth error $Z_{err}$ to correct the formula.
  • Figure 3: The structure of region segmentation module. The multi-scale feature maps are progressively upsampled and added, outputting target region probabilities by segment heads. Subsequently, enhanced features are achieved through element-wise multiplication with original features. Finally, segment embeddings are acquired under threshold constraints.
  • Figure 4: (a) Schematic illustration of how geometry error occurs. Due to the height of the bounding box $h_{\text{bbox}}$ is larger than the projected central height $h_c$, there will be an error between geometric depth and ground truth depth. (b) Vehicles at different angles in a bird's-eye view. The geometric depth $Z_{\text{geo}}$ is the distance between the camera plane and the parallel plane on which the closest wheel sits, while depth error $Z_{\text{err}}$ only depends on the object's dimensions and orientation. The camera height also has a tiny effect on the $Z_{\text{err}}$ , rigorous proof is given in the appendix.
  • Figure 5: Comparison of 3D data distributions. With the total number of objects remaining constant, each distribution is visualized using a histogram.
  • ...and 5 more figures