Table of Contents
Fetching ...

The Cityscapes Dataset for Semantic Urban Scene Understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, Bernt Schiele

TL;DR

Cityscapes addresses the need for large-scale, diverse urban-scene data by introducing a benchmark with dense pixel-level and instance-level annotations across 50 cities, complemented by stereo depth information. The authors quantify dataset characteristics, provide robust evaluation metrics (IoU, iIoU, AP), and conduct extensive baselines and cross-dataset analyses to reveal how urban scenes differ from generic datasets. Key contributions include the largest richly annotated urban dataset to date, a rigorous evaluation framework for pixel- and instance-level tasks, and insights into how coarse labels, downsampling, and proposal quality impact performance. The work underscores the importance of high-resolution, variable-condition data for advancing semantic and instance-level understanding in real-world driving scenarios, and it sets the stage for future dataset expansions and method development.

Abstract

Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations; 20000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.

The Cityscapes Dataset for Semantic Urban Scene Understanding

TL;DR

Cityscapes addresses the need for large-scale, diverse urban-scene data by introducing a benchmark with dense pixel-level and instance-level annotations across 50 cities, complemented by stereo depth information. The authors quantify dataset characteristics, provide robust evaluation metrics (IoU, iIoU, AP), and conduct extensive baselines and cross-dataset analyses to reveal how urban scenes differ from generic datasets. Key contributions include the largest richly annotated urban dataset to date, a rigorous evaluation framework for pixel- and instance-level tasks, and insights into how coarse labels, downsampling, and proposal quality impact performance. The work underscores the importance of high-resolution, variable-condition data for advancing semantic and instance-level understanding in real-world driving scenarios, and it sets the stage for future dataset expansions and method development.

Abstract

Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations; 20000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.

Paper Structure

This paper contains 23 sections, 12 figures, 21 tables.

Figures (12)

  • Figure 1: Number of finely annotated pixels (y-axis) per class and their associated categories (x-axis).
  • Figure 2: Proportion of annotated pixels (y-axis) per category (x-axis) for Cityscapes, CamVid Brostow2009, DUS Scharwachter2013, and KITTI Geiger2013a.
  • Figure 3: Dataset statistics regarding scene complexity. Only MS COCO and Cityscapes provide instance segmentation masks.
  • Figure 4: Histogram of object distances in meters for class vehicle.
  • Figure 5: Qualitative examples of selected baselines. From left to right: image with stereo depth maps partially overlayed, annotation, DeepLab Papandreou2015, Adelaide Lin2015, and Dilated10 Yu2016. The color coding of the semantic classes matches \ref{['fig:pixeldistr']}.
  • ...and 7 more figures