Table of Contents
Fetching ...

Cityscapes 3D: Dataset and Benchmark for 9 DoF Vehicle Detection

Nils Gählert, Nicolas Jourdan, Marius Cordts, Uwe Franke, Joachim Denzler

TL;DR

This paper introduces Cityscapes 3D, a dataset and benchmark for monocular 3D vehicle detection that labels 3D bounding boxes using stereo RGB imagery rather than lidar, capturing nine degrees of freedom and ensuring pixel-accurate reprojection in RGB. It presents a stereo-based labeling workflow, a detailed dataset analysis, and a novel benchmark with depth-aware metrics that operations within distance bins up to 100 m, along with a two-factor score that combines 2D AP and 3D geometry accuracy. Key contributions include the high-quality 3D annotations paired with 2D instance masks, the avoidance of cross-sensor calibration errors, and the introduction of DDTP metrics (BEVCD, YawSim, PRSim, SizeSim) plus a composite Detection Score for robust monocular evaluation. The work aims to advance monocular 3D detection research, enabling fairer comparisons and more practical deployment considerations for autonomous driving.

Abstract

Detecting vehicles and representing their position and orientation in the three dimensional space is a key technology for autonomous driving. Recently, methods for 3D vehicle detection solely based on monocular RGB images gained popularity. In order to facilitate this task as well as to compare and drive state-of-the-art methods, several new datasets and benchmarks have been published. Ground truth annotations of vehicles are usually obtained using lidar point clouds, which often induces errors due to imperfect calibration or synchronization between both sensors. To this end, we propose Cityscapes 3D, extending the original Cityscapes dataset with 3D bounding box annotations for all types of vehicles. In contrast to existing datasets, our 3D annotations were labeled using stereo RGB images only and capture all nine degrees of freedom. This leads to a pixel-accurate reprojection in the RGB image and a higher range of annotations compared to lidar-based approaches. In order to ease multitask learning, we provide a pairing of 2D instance segments with 3D bounding boxes. In addition, we complement the Cityscapes benchmark suite with 3D vehicle detection based on the new annotations as well as metrics presented in this work. Dataset and benchmark are available online.

Cityscapes 3D: Dataset and Benchmark for 9 DoF Vehicle Detection

TL;DR

This paper introduces Cityscapes 3D, a dataset and benchmark for monocular 3D vehicle detection that labels 3D bounding boxes using stereo RGB imagery rather than lidar, capturing nine degrees of freedom and ensuring pixel-accurate reprojection in RGB. It presents a stereo-based labeling workflow, a detailed dataset analysis, and a novel benchmark with depth-aware metrics that operations within distance bins up to 100 m, along with a two-factor score that combines 2D AP and 3D geometry accuracy. Key contributions include the high-quality 3D annotations paired with 2D instance masks, the avoidance of cross-sensor calibration errors, and the introduction of DDTP metrics (BEVCD, YawSim, PRSim, SizeSim) plus a composite Detection Score for robust monocular evaluation. The work aims to advance monocular 3D detection research, enabling fairer comparisons and more practical deployment considerations for autonomous driving.

Abstract

Detecting vehicles and representing their position and orientation in the three dimensional space is a key technology for autonomous driving. Recently, methods for 3D vehicle detection solely based on monocular RGB images gained popularity. In order to facilitate this task as well as to compare and drive state-of-the-art methods, several new datasets and benchmarks have been published. Ground truth annotations of vehicles are usually obtained using lidar point clouds, which often induces errors due to imperfect calibration or synchronization between both sensors. To this end, we propose Cityscapes 3D, extending the original Cityscapes dataset with 3D bounding box annotations for all types of vehicles. In contrast to existing datasets, our 3D annotations were labeled using stereo RGB images only and capture all nine degrees of freedom. This leads to a pixel-accurate reprojection in the RGB image and a higher range of annotations compared to lidar-based approaches. In order to ease multitask learning, we provide a pairing of 2D instance segments with 3D bounding boxes. In addition, we complement the Cityscapes benchmark suite with 3D vehicle detection based on the new annotations as well as metrics presented in this work. Dataset and benchmark are available online.

Paper Structure

This paper contains 14 sections, 11 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Example image with 3D bounding boxes for vehicles. The box annotations feature a full 3D orientation including yaw, pitch and roll labels.
  • Figure 2: Examples of annotation artifacts due to lidar synchronization errors (toplyft2019, bottomArgoverse).
  • Figure 3: Top: Imprecise 3D box reprojections in nuscenes2019 due to missing pitch and roll annotations. Bottom: Example for a car crossing from a steep street in Cityscapes 3D with pitch $> 0$.
  • Figure 4: Workflow of creating the 3D bounding box annotations given the existing instance polygons and stereo measurements.
  • Figure 5: Example for bird's-eye view labeling aid. 3D bounding box annotations and the stereo point cloud filtered for vehicles and markings are shown from a top view perspective of the scene.
  • ...and 5 more figures