Table of Contents
Fetching ...

LeAD-M3D: Leveraging Asymmetric Distillation for Real-time Monocular 3D Detection

Johannes Meier, Jonathan Michel, Oussema Dhaouadi, Yung-Hsu Yang, Christoph Reich, Zuria Bauer, Stefan Roth, Marc Pollefeys, Jacques Kaiser, Daniel Cremers

TL;DR

LeAD-M3D tackles real-time monocular 3D detection without LiDAR by integrating three innovations: asymmetric augmentation denoising distillation (A2D2) that transfers depth cues from a clean-image teacher to mixup-augmented student, 3D-aware consistent matching (CM3D) that fuses 2D and 3D overlaps for robust assignment, and confidence-gated 3D inference (CGI3D) that reduces expensive 3D regression to high-confidence regions. Built on a YOLOv10-M3D backbone, LeAD-M3D delivers state-of-the-art accuracy on KITTI, Waymo, and Rope3D while achieving real-time inference, outperforming methods that rely on LiDAR, stereo, or geometric priors. Ablation studies show A2D2 as the primary accuracy driver, with CM3D and CGI3D contributing substantial efficiency and stability benefits. The approach establishes a new Pareto frontier for monocular 3D detection, demonstrating that high fidelity and fast inference can be achieved together in a LiDAR-free setting. Potential future directions include leveraging unlabeled data for further distillation, domain-agnostic transfer, and temporal or multi-view extensions.

Abstract

Real-time monocular 3D object detection remains challenging due to severe depth ambiguity, viewpoint shifts, and the high computational cost of 3D reasoning. Existing approaches either rely on LiDAR or geometric priors to compensate for missing depth, or sacrifice efficiency to achieve competitive accuracy. We introduce LeAD-M3D, a monocular 3D detector that achieves state-of-the-art accuracy and real-time inference without extra modalities. Our method is powered by three key components. Asymmetric Augmentation Denoising Distillation (A2D2) transfers geometric knowledge from a clean-image teacher to a mixup-noised student via a quality- and importance-weighted depth-feature loss, enabling stronger depth reasoning without LiDAR supervision. 3D-aware Consistent Matching (CM3D) improves prediction-to-ground truth assignment by integrating 3D MGIoU into the matching score, yielding more stable and precise supervision. Finally, Confidence-Gated 3D Inference (CGI3D) accelerates detection by restricting expensive 3D regression to top-confidence regions. Together, these components set a new Pareto frontier for monocular 3D detection: LeAD-M3D achieves state-of-the-art accuracy on KITTI and Waymo, and the best reported car AP on Rope3D, while running up to 3.6x faster than prior high-accuracy methods. Our results demonstrate that high fidelity and real-time efficiency in monocular 3D detection are simultaneously attainable - without LiDAR, stereo, or geometric assumptions.

LeAD-M3D: Leveraging Asymmetric Distillation for Real-time Monocular 3D Detection

TL;DR

LeAD-M3D tackles real-time monocular 3D detection without LiDAR by integrating three innovations: asymmetric augmentation denoising distillation (A2D2) that transfers depth cues from a clean-image teacher to mixup-augmented student, 3D-aware consistent matching (CM3D) that fuses 2D and 3D overlaps for robust assignment, and confidence-gated 3D inference (CGI3D) that reduces expensive 3D regression to high-confidence regions. Built on a YOLOv10-M3D backbone, LeAD-M3D delivers state-of-the-art accuracy on KITTI, Waymo, and Rope3D while achieving real-time inference, outperforming methods that rely on LiDAR, stereo, or geometric priors. Ablation studies show A2D2 as the primary accuracy driver, with CM3D and CGI3D contributing substantial efficiency and stability benefits. The approach establishes a new Pareto frontier for monocular 3D detection, demonstrating that high fidelity and fast inference can be achieved together in a LiDAR-free setting. Potential future directions include leveraging unlabeled data for further distillation, domain-agnostic transfer, and temporal or multi-view extensions.

Abstract

Real-time monocular 3D object detection remains challenging due to severe depth ambiguity, viewpoint shifts, and the high computational cost of 3D reasoning. Existing approaches either rely on LiDAR or geometric priors to compensate for missing depth, or sacrifice efficiency to achieve competitive accuracy. We introduce LeAD-M3D, a monocular 3D detector that achieves state-of-the-art accuracy and real-time inference without extra modalities. Our method is powered by three key components. Asymmetric Augmentation Denoising Distillation (A2D2) transfers geometric knowledge from a clean-image teacher to a mixup-noised student via a quality- and importance-weighted depth-feature loss, enabling stronger depth reasoning without LiDAR supervision. 3D-aware Consistent Matching (CM3D) improves prediction-to-ground truth assignment by integrating 3D MGIoU into the matching score, yielding more stable and precise supervision. Finally, Confidence-Gated 3D Inference (CGI3D) accelerates detection by restricting expensive 3D regression to top-confidence regions. Together, these components set a new Pareto frontier for monocular 3D detection: LeAD-M3D achieves state-of-the-art accuracy on KITTI and Waymo, and the best reported car AP on Rope3D, while running up to 3.6x faster than prior high-accuracy methods. Our results demonstrate that high fidelity and real-time efficiency in monocular 3D detection are simultaneously attainable - without LiDAR, stereo, or geometric assumptions.

Paper Structure

This paper contains 39 sections, 6 equations, 7 figures, 21 tables.

Figures (7)

  • Figure 1: Runtime vs. Accuracy on the KITTI test set, using $\text{AP}_{\text{3D$|$R40}}^{0.7}$ Mod (in %, $\uparrow$) and runtime (in ms, $\downarrow$). We provide different model variants (N to X) to balance runtime and accuracy. LeAD-M3D offers a Pareto frontier over existing approaches. Our most accurate model outperforms the recent most accurate approach MonoDiff monodiff, while being $3.6\times$ faster. Using TensorRT further improves runtime, enabling real-time inference of even our largest model variant (X). For a fair comparison, all reported methods are re-evaluated on the same hardware (NVIDIA RTX 8000) if the code is publicly available.
  • Figure 2: Overview of LeAD-M3D. (a) We distill high-dimensional instance-depth features from a large teacher to a compact student. To create an information gap, the teacher sees clean images. The student receives a mixup image and must reproduce the teacher's intermediate features (\ref{['sec:method:distill']}). This frames distillation as a denoising task, which removes mixup-induced artifacts. tal uses ground truth to pair corresponding teacher and student predictions (cf.\ref{['sec:tal']}). (b) The model architecture in detail. Dim. stands for dimension head, Orient. stands for orientation head, and Uncert. stands for depth uncertainty head.
  • Figure 3: tal. Disambiguating prediction-to-ground truth assignments in crowded 3D scenes by integrating 2D and 3D overlaps (cf.\ref{['sec:tal']}).
  • Figure 4: cgi: FLOP reduction and speedup by restricting 2D/3D regression to high-confidence locations (cf.\ref{['sec:speedup']}).
  • Figure 5: Qualitative results on the KITTI kitti validation set. LeAD-M3D X achieves more accurate depth estimates than YOLOv10-M3D X. Best viewed in color and with zoom. BEV color coding: Ground truth, YOLOv10-M3D X, LeAD-M3D X, and field of view.
  • ...and 2 more figures