Monocular 3D Object Detection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss
Eskil Jörgensen, Christopher Zach, Fredrik Kahl
TL;DR
Monocular 3D object detection remains challenging due to depth ambiguity, latency, and the need for precise 3D localization. The authors present SS3D, a single-stage monocular detector that regresses a surrogate 3D representation and then fits 3D boxes with a weighted non-linear least-squares optimizer, enabling end-to-end training through a $3$D IoU loss. They explore homoscedastic and heteroscedastic uncertainty models and show that backpropagation through the optimization step further improves accuracy, achieving state-of-the-art results on KITTI while maintaining real-time speed (~20 fps). This framework offers a modular, end-to-end trainable pipeline for monocular 3D perception with per-detection uncertainty estimates, suitable for autonomous driving and mobile robotics, and supports future extensions to temporal data and 3D articulated pose estimation.
Abstract
Three-dimensional object detection from a single view is a challenging task which, if performed with good accuracy, is an important enabler of low-cost mobile robot perception. Previous approaches to this problem suffer either from an overly complex inference engine or from an insufficient detection accuracy. To deal with these issues, we present SS3D, a single-stage monocular 3D object detector. The framework consists of (i) a CNN, which outputs a redundant representation of each relevant object in the image with corresponding uncertainty estimates, and (ii) a 3D bounding box optimizer. We show how modeling heteroscedastic uncertainty improves performance upon our baseline, and furthermore, how back-propagation can be done through the optimizer in order to train the pipeline end-to-end for additional accuracy. Our method achieves SOTA accuracy on monocular 3D object detection, while running at 20 fps in a straightforward implementation. We argue that the SS3D architecture provides a solid framework upon which high performing detection systems can be built, with autonomous driving being the main application in mind.
