Cityscapes 3D: Dataset and Benchmark for 9 DoF Vehicle Detection
Nils Gählert, Nicolas Jourdan, Marius Cordts, Uwe Franke, Joachim Denzler
TL;DR
This paper introduces Cityscapes 3D, a dataset and benchmark for monocular 3D vehicle detection that labels 3D bounding boxes using stereo RGB imagery rather than lidar, capturing nine degrees of freedom and ensuring pixel-accurate reprojection in RGB. It presents a stereo-based labeling workflow, a detailed dataset analysis, and a novel benchmark with depth-aware metrics that operations within distance bins up to 100 m, along with a two-factor score that combines 2D AP and 3D geometry accuracy. Key contributions include the high-quality 3D annotations paired with 2D instance masks, the avoidance of cross-sensor calibration errors, and the introduction of DDTP metrics (BEVCD, YawSim, PRSim, SizeSim) plus a composite Detection Score for robust monocular evaluation. The work aims to advance monocular 3D detection research, enabling fairer comparisons and more practical deployment considerations for autonomous driving.
Abstract
Detecting vehicles and representing their position and orientation in the three dimensional space is a key technology for autonomous driving. Recently, methods for 3D vehicle detection solely based on monocular RGB images gained popularity. In order to facilitate this task as well as to compare and drive state-of-the-art methods, several new datasets and benchmarks have been published. Ground truth annotations of vehicles are usually obtained using lidar point clouds, which often induces errors due to imperfect calibration or synchronization between both sensors. To this end, we propose Cityscapes 3D, extending the original Cityscapes dataset with 3D bounding box annotations for all types of vehicles. In contrast to existing datasets, our 3D annotations were labeled using stereo RGB images only and capture all nine degrees of freedom. This leads to a pixel-accurate reprojection in the RGB image and a higher range of annotations compared to lidar-based approaches. In order to ease multitask learning, we provide a pairing of 2D instance segments with 3D bounding boxes. In addition, we complement the Cityscapes benchmark suite with 3D vehicle detection based on the new annotations as well as metrics presented in this work. Dataset and benchmark are available online.
