Table of Contents
Fetching ...

MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label

Junyoung Jung, Seokwon Kim, Jun Uk Kim

Abstract

Monocular 3D object detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely annotated setting is common in real-world scenarios where annotating every object is impractical. To address this, we propose a novel framework for sparsely annotated monocular 3D object detection with two key modules. First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. It maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates. Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision. Extensive experiments demonstrate the effectiveness of the proposed method. The source code is available at https://github.com/VisualAIKHU/MonoSAOD .

MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label

Abstract

Monocular 3D object detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely annotated setting is common in real-world scenarios where annotating every object is impractical. To address this, we propose a novel framework for sparsely annotated monocular 3D object detection with two key modules. First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. It maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates. Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision. Extensive experiments demonstrate the effectiveness of the proposed method. The source code is available at https://github.com/VisualAIKHU/MonoSAOD .

Paper Structure

This paper contains 17 sections, 10 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: (a) Visible objects are annotated in one scene but missed in another due to the difficulty of 3D bounding box annotation and human error, resulting in inconsistent supervision. (b) Comparison between fully annotated and sparsely annotated labels. In the sparse annotation setting, only a subset of objects is labeled while many valid objects remain unlabeled.
  • Figure 2: Overall architecture of the proposed framework. A teacher–student structure integrates Road-Aware Patch Augmentation (RAPA) and Prototype-Based Filtering (PBF) for robust training under sparse annotations.
  • Figure 3: Progression of pseudo-labels selected by the proposed PBF module for GT Bank enrichment. Green boxes denote sparse ground truths and previously accumulated pseudo-labels, while red boxes indicate high-quality pseudo-labels newly selected at the current step. The consistent selection of geometrically and semantically reliable pseudo-labels highlights the effectiveness of the PBF module.
  • Figure 4: Line plot of GT annotation growth, where 30% (blue), 50% (green), and 70% (red) represent different sparsity levels.
  • Figure 5: Examples of the RAPA module. Objects segmented from a source image are geometrically transformed and placed onto plausible road regions in the target image. It produces visually realistic and geometrically consistent augmentations.
  • ...and 6 more figures