Table of Contents
Fetching ...

IDEAL-M3D: Instance Diversity-Enriched Active Learning for Monocular 3D Detection

Johannes Meier, Florian Günther, Riccardo Marin, Oussema Dhaouadi, Jacques Kaiser, Daniel Cremers

TL;DR

IDEAL-M3D introduces an instance-based active learning framework for monocular 3D object detection, addressing inefficiencies of image-based labeling and biases of uncertainty-based sample selection. The method builds a heterogeneous, fast-to-train ensemble and leverages Core-Set-inspired instance selection, augmented with task-agnostic visual features, to maximize information gain under labeling budgets. A novel NAURC metric enables budget-aware, cross-method comparisons between image- and instance-based AL. Across KITTI, Waymo, and Rope3D, IDEAL-M3D achieves state-of-the-art label efficiency, matching or surpassing full-data performance with only a fraction of labels, and demonstrates strong cross-dataset robustness with practical training-time costs.

Abstract

Monocular 3D detection relies on just a single camera and is therefore easy to deploy. Yet, achieving reliable 3D understanding from monocular images requires substantial annotation, and 3D labels are especially costly. To maximize performance under constrained labeling budgets, it is essential to prioritize annotating samples expected to deliver the largest performance gains. This prioritization is the focus of active learning. Curiously, we observed two significant limitations in active learning algorithms for 3D monocular object detection. First, previous approaches select entire images, which is inefficient, as non-informative instances contained in the same image also need to be labeled. Secondly, existing methods rely on uncertainty-based selection, which in monocular 3D object detection creates a bias toward depth ambiguity. Consequently, distant objects are selected, while nearby objects are overlooked. To address these limitations, we propose IDEAL-M3D, the first instance-level pipeline for monocular 3D detection. For the first time, we demonstrate that an explicitly diverse, fast-to-train ensemble improves diversity-driven active learning for monocular 3D. We induce diversity with heterogeneous backbones and task-agnostic features, loss weight perturbation, and time-dependent bagging. IDEAL-M3D shows superior performance and significant resource savings: with just 60% of the annotations, we achieve similar or better AP3D on KITTI validation and test set results compared to training the same detector on the whole dataset.

IDEAL-M3D: Instance Diversity-Enriched Active Learning for Monocular 3D Detection

TL;DR

IDEAL-M3D introduces an instance-based active learning framework for monocular 3D object detection, addressing inefficiencies of image-based labeling and biases of uncertainty-based sample selection. The method builds a heterogeneous, fast-to-train ensemble and leverages Core-Set-inspired instance selection, augmented with task-agnostic visual features, to maximize information gain under labeling budgets. A novel NAURC metric enables budget-aware, cross-method comparisons between image- and instance-based AL. Across KITTI, Waymo, and Rope3D, IDEAL-M3D achieves state-of-the-art label efficiency, matching or surpassing full-data performance with only a fraction of labels, and demonstrates strong cross-dataset robustness with practical training-time costs.

Abstract

Monocular 3D detection relies on just a single camera and is therefore easy to deploy. Yet, achieving reliable 3D understanding from monocular images requires substantial annotation, and 3D labels are especially costly. To maximize performance under constrained labeling budgets, it is essential to prioritize annotating samples expected to deliver the largest performance gains. This prioritization is the focus of active learning. Curiously, we observed two significant limitations in active learning algorithms for 3D monocular object detection. First, previous approaches select entire images, which is inefficient, as non-informative instances contained in the same image also need to be labeled. Secondly, existing methods rely on uncertainty-based selection, which in monocular 3D object detection creates a bias toward depth ambiguity. Consequently, distant objects are selected, while nearby objects are overlooked. To address these limitations, we propose IDEAL-M3D, the first instance-level pipeline for monocular 3D detection. For the first time, we demonstrate that an explicitly diverse, fast-to-train ensemble improves diversity-driven active learning for monocular 3D. We induce diversity with heterogeneous backbones and task-agnostic features, loss weight perturbation, and time-dependent bagging. IDEAL-M3D shows superior performance and significant resource savings: with just 60% of the annotations, we achieve similar or better AP3D on KITTI validation and test set results compared to training the same detector on the whole dataset.

Paper Structure

This paper contains 44 sections, 18 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: IDEAL-M3D is the first instance-based active learning method for monocular 3D detection. Left: While previous active learning approaches select entire images for labeling, we identify the most informative object instances (difference is highlighted in rounded boxes). Right: Our approach achieves full supervised performance using only 50-60% of the labeled boxes, significantly outperforming existing active learning methods across the majority of object categories (KITTI kitti validation set, $AP_{3D|R_{40}}^{0.7} Mod$.).
  • Figure 2: Overview of IDEAL-M3D.Left: Our instance-based AL pipeline couples precise instance targeting with time-adaptive sampling, minimizing expert effort while remaining training-time efficient (\ref{['sec:04_method_instance_based']}). Right: We maximize feature-space coverage by fusing Core-Set selection with an explicitly diverse, fast-to-train ensemble and task-agnostic visual embeddings, yielding robust geometry-aware selection under modest compute (\ref{['sec:coreset_baseline', 'sec:diverse_ensembles', 'sec:task_agnostic_features']}). IDEAL-M3D uniquely integrates diversity-based selection with an ensemble purpose-built for representational diversity in M3D, delivering label efficiency without the cost of conventional ensembles.
  • Figure 3: AL training curves.Plot 1: We report the KITTI validation kitti performance for cars $AP^{0.7}_{3D|R_{40}}$ Moderate. Plot 2: We report the Waymo validation Waymo performance $AP^{0.5}$ for vehicles. Plots 3-4: We report the Rope3D validation performance $AP_{3D|R_{40}}^{0.5}$ for Cars and Big Vehicles. All results show mean performance across three rounds with identical initialization.
  • Figure 4: Qualitative results of IDEAL-M3D on the KITTI kitti (first/second row), and Rope3D rope3d (third row) datasets. The results demonstrate prediction evolution and label selection strategy across time steps. Color coding: green boxes represent ground truth annotations, pink boxes indicate predictions on previously labeled objects, cyan boxes highlight predictions selected for the next labeling round, and orange boxes show predictions that remain unlabeled (best viewed in color with zoom).
  • Figure 5: Ratio of labeled instances vs. instance-based budget on KITTI kitti. Most image-based methods request images with more than the average number of objects.
  • ...and 12 more figures