WARM-3D: A Weakly-Supervised Sim2Real Domain Adaptation Framework for Roadside Monocular 3D Object Detection
Xingcheng Zhou, Deyu Fu, Walter Zimmer, Mingyu Liu, Venkatnarayanan Lakshminarasimhan, Leah Strand, Alois C. Knoll
TL;DR
This paper tackles the challenge of scarce real-world roadside 3D data by introducing the TUMTraf Synthetic Dataset and a weakly supervised Sim2Real framework, WARM-3D, that leverages off-the-shelf 2D detectors to provide pseudo labels for target-domain adaptation. The method combines Confidence-Aware Bipartite Matching, geometric constraints (2D-3D projective consistency and coplanarity), and an EMA-based teacher–student training regime to improve monocular 3D detection in the real domain. Empirical results show a substantial mAP_3D gain (+12.40 points) over a source-only baseline, with 2D ground-truth labels bringing performance close to Oracle; the approach also demonstrates robustness to unseen conditions. Overall, WARM-3D provides a practical path to deploy reliable roadside monocular 3D perception using synthetic data and readily available 2D annotations, with strong implications for scalable infrastructure sensing.
Abstract
Existing roadside perception systems are limited by the absence of publicly available, large-scale, high-quality 3D datasets. Exploring the use of cost-effective, extensive synthetic datasets offers a viable solution to tackle this challenge and enhance the performance of roadside monocular 3D detection. In this study, we introduce the TUMTraf Synthetic Dataset, offering a diverse and substantial collection of high-quality 3D data to augment scarce real-world datasets. Besides, we present WARM-3D, a concise yet effective framework to aid the Sim2Real domain transfer for roadside monocular 3D detection. Our method leverages cheap synthetic datasets and 2D labels from an off-the-shelf 2D detector for weak supervision. We show that WARM-3D significantly enhances performance, achieving a +12.40% increase in mAP 3D over the baseline with only pseudo-2D supervision. With 2D GT as weak labels, WARM-3D even reaches performance close to the Oracle baseline. Moreover, WARM-3D improves the ability of 3D detectors to unseen sample recognition across various real-world environments, highlighting its potential for practical applications.
