Table of Contents
Fetching ...

Difficulty-Aware Label-Guided Denoising for Monocular 3D Object Detection

Soyul Lee, Seungmin Baek, Dongbo Min

TL;DR

This work tackles the ill-posed nature of monocular 3D object detection by introducing MonoDLGD, a training-time denoising framework that perturbs ground-truth labels guided by instance-level uncertainty. It couples a 3D-DAB query mechanism with a difficulty-aware perturbation and reconstruction pipeline to impose explicit geometric supervision, improving geometry-aware representation learning without adding inference cost. Across KITTI benchmarks, MonoDLGD delivers state-of-the-art 3D and BEV detection performance across Easy, Moderate, and Hard levels, especially under occlusion, distance, and truncation challenges. The approach demonstrates that uncertainty-guided denoising and depth-aware geometric supervision can significantly enhance monocular 3D perception, with broad compatibility to DETR-based detectors and practical training-time gains.

Abstract

Monocular 3D object detection is a cost-effective solution for applications like autonomous driving and robotics, but remains fundamentally ill-posed due to inherently ambiguous depth cues. Recent DETR-based methods attempt to mitigate this through global attention and auxiliary depth prediction, yet they still struggle with inaccurate depth estimates. Moreover, these methods often overlook instance-level detection difficulty, such as occlusion, distance, and truncation, leading to suboptimal detection performance. We propose MonoDLGD, a novel Difficulty-Aware Label-Guided Denoising framework that adaptively perturbs and reconstructs ground-truth labels based on detection uncertainty. Specifically, MonoDLGD applies stronger perturbations to easier instances and weaker ones into harder cases, and then reconstructs them to effectively provide explicit geometric supervision. By jointly optimizing label reconstruction and 3D object detection, MonoDLGD encourages geometry-aware representation learning and improves robustness to varying levels of object complexity. Extensive experiments on the KITTI benchmark demonstrate that MonoDLGD achieves state-of-the-art performance across all difficulty levels.

Difficulty-Aware Label-Guided Denoising for Monocular 3D Object Detection

TL;DR

This work tackles the ill-posed nature of monocular 3D object detection by introducing MonoDLGD, a training-time denoising framework that perturbs ground-truth labels guided by instance-level uncertainty. It couples a 3D-DAB query mechanism with a difficulty-aware perturbation and reconstruction pipeline to impose explicit geometric supervision, improving geometry-aware representation learning without adding inference cost. Across KITTI benchmarks, MonoDLGD delivers state-of-the-art 3D and BEV detection performance across Easy, Moderate, and Hard levels, especially under occlusion, distance, and truncation challenges. The approach demonstrates that uncertainty-guided denoising and depth-aware geometric supervision can significantly enhance monocular 3D perception, with broad compatibility to DETR-based detectors and practical training-time gains.

Abstract

Monocular 3D object detection is a cost-effective solution for applications like autonomous driving and robotics, but remains fundamentally ill-posed due to inherently ambiguous depth cues. Recent DETR-based methods attempt to mitigate this through global attention and auxiliary depth prediction, yet they still struggle with inaccurate depth estimates. Moreover, these methods often overlook instance-level detection difficulty, such as occlusion, distance, and truncation, leading to suboptimal detection performance. We propose MonoDLGD, a novel Difficulty-Aware Label-Guided Denoising framework that adaptively perturbs and reconstructs ground-truth labels based on detection uncertainty. Specifically, MonoDLGD applies stronger perturbations to easier instances and weaker ones into harder cases, and then reconstructs them to effectively provide explicit geometric supervision. By jointly optimizing label reconstruction and 3D object detection, MonoDLGD encourages geometry-aware representation learning and improves robustness to varying levels of object complexity. Extensive experiments on the KITTI benchmark demonstrate that MonoDLGD achieves state-of-the-art performance across all difficulty levels.

Paper Structure

This paper contains 40 sections, 20 equations, 6 figures, 13 tables, 1 algorithm.

Figures (6)

  • Figure 1: Depth-Centric Detection: Elevating Depth and BEV Accuracy via MonoDLGD. (a) Depth estimation accuracy using mean absolute error (MAE) on the KITTI validation set for MonoDGP monodgp, MonoDGP with our 3D-DAB, and ours. (b) Bird's-eye view (BEV) visualization. MonoDLGD achieves more accurate and robust detection across varying object distances, highlighting its improved geometric understanding.
  • Figure 2: Structural Comparison with MonoDGP monodgp. Our method introduces 3D-DAB queries to encode spatial priors and explicitly provides 3D geometric supervision by reconstructing perturbed label queries within a shared decoder.
  • Figure 3: Overview of the proposed MonoDLGD: (a) MonoDLGD adopts a two-stage architecture after extracting the encoder feature $f_{Enc}$ containing depth and visual features. Stage 1 (red arrows) performs Difficulty-Aware Perturbation (DAP) by first estimating the uncertainty of bounding box ($\sigma^{l}$, $\sigma^{t}$, $\sigma^{r}$, $\sigma^{b}$) and depth ($\sigma^d$) attributes and then adaptively perturbing the label queries based on the estimated uncertainties. Stage 2 (blue and green arrows) feeds both the perturbed label queries and the 3D-DAB queries into the decoder. (b) illustrates the internal components of label queries and 3D-DAB queries, all of which share a common structure.
  • Figure 4: Difficulty-Aware Perturbation (DAP) for projected bounding boxes. Objects with lower uncertainty (lower difficulty scores) receives larger perturbations (e.g., yellow box).
  • Figure 5: Distribution of predicted uncertainty across difficulty levels on the KITTI validation set. The x-axis represents the difficulty levels defined by the KITTI benchmark, where level 1 corresponds to Easy and level 3 corresponds to Hard. The y-axis shows the distributios and mean values of uncertainties predicted by the depth and projected bounding box detection heads.
  • ...and 1 more figures