Transfer Learning from Simulated to Real Scenes for Monocular 3D Object Detection

Sondos Mohamed; Walter Zimmer; Ross Greer; Ahmed Alaaeldin Ghita; Modesto Castrillón-Santana; Mohan Trivedi; Alois Knoll; Salvatore Mario Carta; Mirko Marras

Transfer Learning from Simulated to Real Scenes for Monocular 3D Object Detection

Sondos Mohamed, Walter Zimmer, Ross Greer, Ahmed Alaaeldin Ghita, Modesto Castrillón-Santana, Mohan Trivedi, Alois Knoll, Salvatore Mario Carta, Mirko Marras

TL;DR

This work tackles monocular 3D object detection in roadside scenes, where domain gaps between synthetic and real data hinder performance. It proposes a two-stage transfer learning pipeline that pretrains Cube R-CNN on the large synthetic RoadSense3D dataset and then fine-tunes on real-world datasets (TUMTraf-A9 and DAIR-V2X-I), explicitly incorporating pitch and roll. The experiments show dramatic improvements in 3D mAP, with single-step transfer boosting $mAP_{3D}$ from $0.26$ to $12.76$ on TUMTraf-A9 and from $2.09$ to $6.60$ on DAIR-V2X-I, while multi-step transfer yields a lower peak due to potential domain gaps. The results demonstrate the effectiveness of sim-to-real transfer for roadside monocular perception and suggest that direct fine-tuning on target data offers the best performance, with implications for robust smart-city perception pipelines and broader autonomous-driving applications. The paper also provides public release of code, data, and qualitative results, and outlines future directions including extending to additional monocular methods, broader orientation coverage, and integration with active learning and anomaly detection.

Abstract

Accurately detecting 3D objects from monocular images in dynamic roadside scenarios remains a challenging problem due to varying camera perspectives and unpredictable scene conditions. This paper introduces a two-stage training strategy to address these challenges. Our approach initially trains a model on the large-scale synthetic dataset, RoadSense3D, which offers a diverse range of scenarios for robust feature learning. Subsequently, we fine-tune the model on a combination of real-world datasets to enhance its adaptability to practical conditions. Experimental results of the Cube R-CNN model on challenging public benchmarks show a remarkable improvement in detection performance, with a mean average precision rising from 0.26 to 12.76 on the TUM Traffic A9 Highway dataset and from 2.09 to 6.60 on the DAIR-V2X-I dataset when performing transfer learning. Code, data, and qualitative video results are available on the project website: https://roadsense3d.github.io.

Transfer Learning from Simulated to Real Scenes for Monocular 3D Object Detection

TL;DR

from

on TUMTraf-A9 and from

on DAIR-V2X-I, while multi-step transfer yields a lower peak due to potential domain gaps. The results demonstrate the effectiveness of sim-to-real transfer for roadside monocular perception and suggest that direct fine-tuning on target data offers the best performance, with implications for robust smart-city perception pipelines and broader autonomous-driving applications. The paper also provides public release of code, data, and qualitative results, and outlines future directions including extending to additional monocular methods, broader orientation coverage, and integration with active learning and anomaly detection.

Abstract

Paper Structure (17 sections, 2 equations, 2 figures, 5 tables)

This paper contains 17 sections, 2 equations, 2 figures, 5 tables.

Introduction
Related Work
Datasets for 3D Monocular Object Detection
Methods for Monocular 3D Object Detection
Methodology
Problem Definition
Model Selection
Initial Model Creation
Synthetic Dataset Selection.
Training from Scratch.
Pretrained Model Transfer
Real-World Datasets Selection.
Fine-tuning.
Experimental Results
Single-Step Dataset Transfer
...and 2 more sections

Figures (2)

Figure 1: Qualitative Results on the Synthetic RoadSense3D Test Set. We show 3D box detections of the Cube R-CNN model in the class-specific colors during different lighting and weather conditions.
Figure 2: Qualitative Results of Cube R-CNN on the TUMTraf-A9 Test Set. Comparison between the Cube R-CNN model trained from scratch on TUMTraf-A9 (top) and model trained on RoadSense3D and fine-tuned on TUMTraf-A9 (bottom).

Transfer Learning from Simulated to Real Scenes for Monocular 3D Object Detection

TL;DR

Abstract

Transfer Learning from Simulated to Real Scenes for Monocular 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)