Table of Contents
Fetching ...

MonoLSS: Learnable Sample Selection For Monocular 3D Detection

Zhenjia Li, Jinrang Jia, Yifeng Shi

TL;DR

This work tackles the challenge of uneven feature usefulness in monocular 3D detection by introducing Learnable Sample Selection (LSS), which uses a Gumbel-Softmax based sampler and a relative-distance threshold to adaptively select informative object-level samples for 3D property regression. It pairs LSS with MixUp3D, a physically plausible spatial overlap augmentation, enabling richer and less ambiguous 3D samples without additional labels. The approach yields state-of-the-art KITTI results across Car, Cyclist, and Pedestrian without extra data and demonstrates robust cross-dataset performance on Waymo and nuScenes, while remaining end-to-end trainable and efficient. Overall, LSS and MixUp3D provide orthogonal, complementary improvements to feature selection and data diversity in monocular 3D detection, with strong practical implications for autonomous driving systems.

Abstract

In the field of autonomous driving, monocular 3D detection is a critical task which estimates 3D properties (depth, dimension, and orientation) of objects in a single RGB image. Previous works have used features in a heuristic way to learn 3D properties, without considering that inappropriate features could have adverse effects. In this paper, sample selection is introduced that only suitable samples should be trained to regress the 3D properties. To select samples adaptively, we propose a Learnable Sample Selection (LSS) module, which is based on Gumbel-Softmax and a relative-distance sample divider. The LSS module works under a warm-up strategy leading to an improvement in training stability. Additionally, since the LSS module dedicated to 3D property sample selection relies on object-level features, we further develop a data augmentation method named MixUp3D to enrich 3D property samples which conforms to imaging principles without introducing ambiguity. As two orthogonal methods, the LSS module and MixUp3D can be utilized independently or in conjunction. Sufficient experiments have shown that their combined use can lead to synergistic effects, yielding improvements that transcend the mere sum of their individual applications. Leveraging the LSS module and the MixUp3D, without any extra data, our method named MonoLSS ranks 1st in all three categories (Car, Cyclist, and Pedestrian) on KITTI 3D object detection benchmark, and achieves competitive results on both the Waymo dataset and KITTI-nuScenes cross-dataset evaluation. The code is included in the supplementary material and will be released to facilitate related academic and industrial studies.

MonoLSS: Learnable Sample Selection For Monocular 3D Detection

TL;DR

This work tackles the challenge of uneven feature usefulness in monocular 3D detection by introducing Learnable Sample Selection (LSS), which uses a Gumbel-Softmax based sampler and a relative-distance threshold to adaptively select informative object-level samples for 3D property regression. It pairs LSS with MixUp3D, a physically plausible spatial overlap augmentation, enabling richer and less ambiguous 3D samples without additional labels. The approach yields state-of-the-art KITTI results across Car, Cyclist, and Pedestrian without extra data and demonstrates robust cross-dataset performance on Waymo and nuScenes, while remaining end-to-end trainable and efficient. Overall, LSS and MixUp3D provide orthogonal, complementary improvements to feature selection and data diversity in monocular 3D detection, with strong practical implications for autonomous driving systems.

Abstract

In the field of autonomous driving, monocular 3D detection is a critical task which estimates 3D properties (depth, dimension, and orientation) of objects in a single RGB image. Previous works have used features in a heuristic way to learn 3D properties, without considering that inappropriate features could have adverse effects. In this paper, sample selection is introduced that only suitable samples should be trained to regress the 3D properties. To select samples adaptively, we propose a Learnable Sample Selection (LSS) module, which is based on Gumbel-Softmax and a relative-distance sample divider. The LSS module works under a warm-up strategy leading to an improvement in training stability. Additionally, since the LSS module dedicated to 3D property sample selection relies on object-level features, we further develop a data augmentation method named MixUp3D to enrich 3D property samples which conforms to imaging principles without introducing ambiguity. As two orthogonal methods, the LSS module and MixUp3D can be utilized independently or in conjunction. Sufficient experiments have shown that their combined use can lead to synergistic effects, yielding improvements that transcend the mere sum of their individual applications. Leveraging the LSS module and the MixUp3D, without any extra data, our method named MonoLSS ranks 1st in all three categories (Car, Cyclist, and Pedestrian) on KITTI 3D object detection benchmark, and achieves competitive results on both the Waymo dataset and KITTI-nuScenes cross-dataset evaluation. The code is included in the supplementary material and will be released to facilitate related academic and industrial studies.
Paper Structure (12 sections, 4 equations, 4 figures, 6 tables)

This paper contains 12 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison among various features using by different methods for 3D property learning. When the occluded white vehicle is the target of detection, various methods utilize distinct features for learning. Yellow color means features used and blue means not.
  • Figure 2: An overview of the MonoLSS framework. First, a 2D detector combined with ROI-Align is used to generate object features. Then, six heads respectively predict 3D properties (depth, dimension, orientation, and 3D center projection offset), depth uncertainty, and logarithmic probability. Finally, the Learnable Sample Selection (LSS) module adaptively selects samples and acts on the loss calculation.
  • Figure 3: Visualization of the MixUp3D which simulates spatial overlap. A car overlaps a bicycle in the physical world and their appearance features in resulting image do not introduce ambiguity for 3D property learning.
  • Figure 4: Qualitative visualization of some samples on KITTI val set. The 3D red boxes are produced by MonoLSS and the green boxes are the ground truth. Some unlabeled objects detected by MonoLSS are highlighted on images. The last line represents the LSS sampling map of the corresponding object. Best viewed in color with zoom in.