Table of Contents
Fetching ...

MonoCD: Monocular 3D Object Detection with Complementary Depths

Longfei Yan, Pei Yan, Shengzhou Xiong, Xuanyu Xiang, Yihua Tan

TL;DR

This work addresses the persistent issue of depth-coupling in monocular 3D object detection by introducing complementary depths. It adds a global-clue depth branch and exploits geometric relationships among depth cues to encourage opposite-sign errors across branches, improving the effectiveness of depth fusion. The approach fuses multiple depth predictions with learned uncertainties, forming a final depth via $z_{soft}=\sum w_i z_i$ where $w_i=1/\sigma_i$. On KITTI, MonoCD achieves state-of-the-art results without extra data and demonstrates that complementary depth is lightweight and transferable as a plug-in for other detectors. Overall, the paper provides both theoretical and empirical evidence that increasing depth complementarity yields meaningful gains in 3D localization accuracy for monocular systems.

Abstract

Monocular 3D object detection has attracted widespread attention due to its potential to accurately obtain object 3D localization from a single image at a low cost. Depth estimation is an essential but challenging subtask of monocular 3D object detection due to the ill-posedness of 2D to 3D mapping. Many methods explore multiple local depth clues such as object heights and keypoints and then formulate the object depth estimation as an ensemble of multiple depth predictions to mitigate the insufficiency of single-depth information. However, the errors of existing multiple depths tend to have the same sign, which hinders them from neutralizing each other and limits the overall accuracy of combined depth. To alleviate this problem, we propose to increase the complementarity of depths with two novel designs. First, we add a new depth prediction branch named complementary depth that utilizes global and efficient depth clues from the entire image rather than the local clues to reduce the correlation of depth predictions. Second, we propose to fully exploit the geometric relations between multiple depth clues to achieve complementarity in form. Benefiting from these designs, our method achieves higher complementarity. Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data. In addition, complementary depth can also be a lightweight and plug-and-play module to boost multiple existing monocular 3d object detectors. Code is available at https://github.com/elvintanhust/MonoCD.

MonoCD: Monocular 3D Object Detection with Complementary Depths

TL;DR

This work addresses the persistent issue of depth-coupling in monocular 3D object detection by introducing complementary depths. It adds a global-clue depth branch and exploits geometric relationships among depth cues to encourage opposite-sign errors across branches, improving the effectiveness of depth fusion. The approach fuses multiple depth predictions with learned uncertainties, forming a final depth via where . On KITTI, MonoCD achieves state-of-the-art results without extra data and demonstrates that complementary depth is lightweight and transferable as a plug-in for other detectors. Overall, the paper provides both theoretical and empirical evidence that increasing depth complementarity yields meaningful gains in 3D localization accuracy for monocular systems.

Abstract

Monocular 3D object detection has attracted widespread attention due to its potential to accurately obtain object 3D localization from a single image at a low cost. Depth estimation is an essential but challenging subtask of monocular 3D object detection due to the ill-posedness of 2D to 3D mapping. Many methods explore multiple local depth clues such as object heights and keypoints and then formulate the object depth estimation as an ensemble of multiple depth predictions to mitigate the insufficiency of single-depth information. However, the errors of existing multiple depths tend to have the same sign, which hinders them from neutralizing each other and limits the overall accuracy of combined depth. To alleviate this problem, we propose to increase the complementarity of depths with two novel designs. First, we add a new depth prediction branch named complementary depth that utilizes global and efficient depth clues from the entire image rather than the local clues to reduce the correlation of depth predictions. Second, we propose to fully exploit the geometric relations between multiple depth clues to achieve complementarity in form. Benefiting from these designs, our method achieves higher complementarity. Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data. In addition, complementary depth can also be a lightweight and plug-and-play module to boost multiple existing monocular 3d object detectors. Code is available at https://github.com/elvintanhust/MonoCD.
Paper Structure (23 sections, 16 equations, 5 figures, 9 tables)

This paper contains 23 sections, 16 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: (a) Comparison of coupling(coup) and complementary(comp) multi-depth with two depth branches $Z_1$ and $Z_2$, where $Z^*$ and $Z_{soft}$ represents the ground truth of the depth and the final combined depth respectively. (b) A complementary demonstration of the two depth branches with the help of geometrical relations when considering only the inaccurate estimation of the object 3D height $H$. Both $Z_1$ generated by the widely used local height clue and $Z_2$ generated by our newly introduced global clue $y_{glo}$ are related to $H$. $H^*$ and $\hat{H}$ denote the ground truth of $H$ and the underestimated $H$ respectively.
  • Figure 2: Overview of the approach. The input image is first subjected to processing by a feature extraction network and subsequently directed into multiple prediction heads. The prediction heads are divided into two parts. The upper orange section is used to predict the global horizon heatmap of the image, serving as a global clue to generate the prediction of complementary depths ($z_{comp}$). The lower blue section, after predicting local information for each point of interest, further generates keypoint depths ($z_{key}$) and direct depth ($z_{dir}$). Finally, the three depth prediction branches are weighted and combined using simultaneously predicted uncertainties to obtain the final depth estimation.
  • Figure 3: Evaluation of complementary effect on the KITTI validation set. The metric is $AP_{40}$ for the moderate Car category at 0.7 IoU threshold. Left: Different proportions of flipped samples achieve different levels of complementarity. Right: Fixing the proportion of flipped samples to 50% and applying random disturbances of different magnitudes to the flipped depth branch.
  • Figure 4: Geometric correspondence of different depths. To avoid overlap, the geometric correspondences of $z_{key}$ and $z_{comp}$ are marked with orange and blue lines, respectively.
  • Figure 5: Qualitative examples on KITTI validation set. In each row, we provide one final front view (left) and four bird's-eye view (right) visualizations. The detection results for the various bird's-eye views vary only in terms of the depth output, progressing from $z_{soft}$ to $z_{dir}$, $z_{key}$, and $z_{comp}$ from left to right. Red represents the ground truth of boxes, while Green represents the predictions. We circle some objects to highlight the differences across multiple depth prediction branches.