Transfer Learning from Simulated to Real Scenes for Monocular 3D Object Detection
Sondos Mohamed, Walter Zimmer, Ross Greer, Ahmed Alaaeldin Ghita, Modesto Castrillón-Santana, Mohan Trivedi, Alois Knoll, Salvatore Mario Carta, Mirko Marras
TL;DR
This work tackles monocular 3D object detection in roadside scenes, where domain gaps between synthetic and real data hinder performance. It proposes a two-stage transfer learning pipeline that pretrains Cube R-CNN on the large synthetic RoadSense3D dataset and then fine-tunes on real-world datasets (TUMTraf-A9 and DAIR-V2X-I), explicitly incorporating pitch and roll. The experiments show dramatic improvements in 3D mAP, with single-step transfer boosting $mAP_{3D}$ from $0.26$ to $12.76$ on TUMTraf-A9 and from $2.09$ to $6.60$ on DAIR-V2X-I, while multi-step transfer yields a lower peak due to potential domain gaps. The results demonstrate the effectiveness of sim-to-real transfer for roadside monocular perception and suggest that direct fine-tuning on target data offers the best performance, with implications for robust smart-city perception pipelines and broader autonomous-driving applications. The paper also provides public release of code, data, and qualitative results, and outlines future directions including extending to additional monocular methods, broader orientation coverage, and integration with active learning and anomaly detection.
Abstract
Accurately detecting 3D objects from monocular images in dynamic roadside scenarios remains a challenging problem due to varying camera perspectives and unpredictable scene conditions. This paper introduces a two-stage training strategy to address these challenges. Our approach initially trains a model on the large-scale synthetic dataset, RoadSense3D, which offers a diverse range of scenarios for robust feature learning. Subsequently, we fine-tune the model on a combination of real-world datasets to enhance its adaptability to practical conditions. Experimental results of the Cube R-CNN model on challenging public benchmarks show a remarkable improvement in detection performance, with a mean average precision rising from 0.26 to 12.76 on the TUM Traffic A9 Highway dataset and from 2.09 to 6.60 on the DAIR-V2X-I dataset when performing transfer learning. Code, data, and qualitative video results are available on the project website: https://roadsense3d.github.io.
