Table of Contents
Fetching ...

WARM-3D: A Weakly-Supervised Sim2Real Domain Adaptation Framework for Roadside Monocular 3D Object Detection

Xingcheng Zhou, Deyu Fu, Walter Zimmer, Mingyu Liu, Venkatnarayanan Lakshminarasimhan, Leah Strand, Alois C. Knoll

TL;DR

This paper tackles the challenge of scarce real-world roadside 3D data by introducing the TUMTraf Synthetic Dataset and a weakly supervised Sim2Real framework, WARM-3D, that leverages off-the-shelf 2D detectors to provide pseudo labels for target-domain adaptation. The method combines Confidence-Aware Bipartite Matching, geometric constraints (2D-3D projective consistency and coplanarity), and an EMA-based teacher–student training regime to improve monocular 3D detection in the real domain. Empirical results show a substantial mAP_3D gain (+12.40 points) over a source-only baseline, with 2D ground-truth labels bringing performance close to Oracle; the approach also demonstrates robustness to unseen conditions. Overall, WARM-3D provides a practical path to deploy reliable roadside monocular 3D perception using synthetic data and readily available 2D annotations, with strong implications for scalable infrastructure sensing.

Abstract

Existing roadside perception systems are limited by the absence of publicly available, large-scale, high-quality 3D datasets. Exploring the use of cost-effective, extensive synthetic datasets offers a viable solution to tackle this challenge and enhance the performance of roadside monocular 3D detection. In this study, we introduce the TUMTraf Synthetic Dataset, offering a diverse and substantial collection of high-quality 3D data to augment scarce real-world datasets. Besides, we present WARM-3D, a concise yet effective framework to aid the Sim2Real domain transfer for roadside monocular 3D detection. Our method leverages cheap synthetic datasets and 2D labels from an off-the-shelf 2D detector for weak supervision. We show that WARM-3D significantly enhances performance, achieving a +12.40% increase in mAP 3D over the baseline with only pseudo-2D supervision. With 2D GT as weak labels, WARM-3D even reaches performance close to the Oracle baseline. Moreover, WARM-3D improves the ability of 3D detectors to unseen sample recognition across various real-world environments, highlighting its potential for practical applications.

WARM-3D: A Weakly-Supervised Sim2Real Domain Adaptation Framework for Roadside Monocular 3D Object Detection

TL;DR

This paper tackles the challenge of scarce real-world roadside 3D data by introducing the TUMTraf Synthetic Dataset and a weakly supervised Sim2Real framework, WARM-3D, that leverages off-the-shelf 2D detectors to provide pseudo labels for target-domain adaptation. The method combines Confidence-Aware Bipartite Matching, geometric constraints (2D-3D projective consistency and coplanarity), and an EMA-based teacher–student training regime to improve monocular 3D detection in the real domain. Empirical results show a substantial mAP_3D gain (+12.40 points) over a source-only baseline, with 2D ground-truth labels bringing performance close to Oracle; the approach also demonstrates robustness to unseen conditions. Overall, WARM-3D provides a practical path to deploy reliable roadside monocular 3D perception using synthetic data and readily available 2D annotations, with strong implications for scalable infrastructure sensing.

Abstract

Existing roadside perception systems are limited by the absence of publicly available, large-scale, high-quality 3D datasets. Exploring the use of cost-effective, extensive synthetic datasets offers a viable solution to tackle this challenge and enhance the performance of roadside monocular 3D detection. In this study, we introduce the TUMTraf Synthetic Dataset, offering a diverse and substantial collection of high-quality 3D data to augment scarce real-world datasets. Besides, we present WARM-3D, a concise yet effective framework to aid the Sim2Real domain transfer for roadside monocular 3D detection. Our method leverages cheap synthetic datasets and 2D labels from an off-the-shelf 2D detector for weak supervision. We show that WARM-3D significantly enhances performance, achieving a +12.40% increase in mAP 3D over the baseline with only pseudo-2D supervision. With 2D GT as weak labels, WARM-3D even reaches performance close to the Oracle baseline. Moreover, WARM-3D improves the ability of 3D detectors to unseen sample recognition across various real-world environments, highlighting its potential for practical applications.
Paper Structure (20 sections, 7 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 7 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of our WARM-3D (Weakly-Supervised Domain Adaptation for Roadside Monocular 3D Object Detection) Framework. WARM-3D is a cross-domain transfer learning framework for roadside monocular 3D object detection leveraging mature off-the-shelf 2D object detectors to guide the weak-label supervision.
  • Figure 2: Ground truth visualization of TUMTraf-S Dataset. RBG image with 3D bounding box label (top) and instance-level segmentation label (bottom) for three samples: a) sunset, b) rain, and c) night.
  • Figure 3: Comparative analysis of the training set from TUMTraf-I Dataset (left) and the TUMTraf-S Dataset (right), highlighting the data correlations between these two datasets.
  • Figure 4: Illustration of the confidence-aware bipartite matching process. $\{ \mathcal{G}_{3D}, \mathcal{G}_{2D} \}$ denotes the matched 3D bounding box set and kept 2D bounding box set. The lock and fire represent the frozen and unfrozen models, respectively.
  • Figure 5: Visualizations of WARM-3D performance on TUMTraf-I Dataset test set. The WARM-3D in f) method performs even qualitatively better than the oracle method in a). The WARM-3D adaptation process is shown from 0 training steps in c) to 20k training steps in f). It's evident that the False Negative objects turn to True Positive, while False Positive objects become True Negative in the process.
  • ...and 3 more figures