Table of Contents
Fetching ...

OBMO: One Bounding Box Multiple Objects for Monocular 3D Object Detection

Chenxi Huang, Tong He, Haidong Ren, Wenxiao Wang, Binbin Lin, Deng Cai

TL;DR

This paper proposes a simple yet effective plug-and-play module that significantly improves state-of-the-art monocular 3D detectors by a significant margin, and carefully design two label scoring strategies to represent their quality.

Abstract

Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, \underline{O}ne \underline{B}ounding Box \underline{M}ultiple \underline{O}bjects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ \textbf{mAP in BEV} and $\mathbf{1.18\sim 9.36\%}$ \textbf{mAP in 3D}). Codes have been released at \url{https://github.com/mrsempress/OBMO}.

OBMO: One Bounding Box Multiple Objects for Monocular 3D Object Detection

TL;DR

This paper proposes a simple yet effective plug-and-play module that significantly improves state-of-the-art monocular 3D detectors by a significant margin, and carefully design two label scoring strategies to represent their quality.

Abstract

Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, \underline{O}ne \underline{B}ounding Box \underline{M}ultiple \underline{O}bjects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are \textbf{mAP in BEV} and \textbf{mAP in 3D}). Codes have been released at \url{https://github.com/mrsempress/OBMO}.
Paper Structure (24 sections, 6 equations, 7 figures, 12 tables)

This paper contains 24 sections, 6 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Objects with different depths and dimensions in 3D space. Objects $P$ and $Q$ have the same bounding box and similar visual features in the 2D image, leading to depth ambiguity.
  • Figure 2: Different views of object $P$ and $Q$ with the same 2D bounding box.
  • Figure 3: The loss based on PatchNet PatchNet. We can see that both predictions of dimension and orientation are stable.
  • Figure 4: The architecture of OBMO with label scoring strategies embedded on GUPNet. The differences are marked in orange. ⓒ means "compare". The OBMO module is used to produce a set of pseudo labels and adds an extra attribute to measure their quality. The Label Score branch is inserted into GUPNet, parallel with 3D prediction branches. Moreover, the OBMO module only works in the training stage.
  • Figure 5: The depth loss with/without the proposed module (OBMO) based on PatchNet. We can see that OBMO can stabilize depth training from the more stable loss curve.
  • ...and 2 more figures