Table of Contents
Fetching ...

A Survey on Autonomous Driving Datasets: Statistics, Annotation Quality, and a Future Outlook

Mingyu Liu, Ekim Yurtsever, Jonathan Fossaert, Xingcheng Zhou, Walter Zimmer, Yuning Cui, Bare Luka Zagar, Alois C. Knoll

TL;DR

The paper thoroughly surveys 265 autonomous driving datasets, introducing an impact score to quantify dataset influence and guiding future dataset creation. It analyzes sensor modalities, sensing domains, annotation pipelines, and data distribution, while assessing adversarial environmental effects on performance. The work highlights high-influence datasets across perception, prediction, planning, control, and end-to-end driving, and discusses future directions such as VLM-based data generation, domain adaptation, and open data ecosystems. This comprehensive resource provides a structured foundation for dataset selection, standardization, and the design of next-generation autonomous driving benchmarks with broader geographic and environmental coverage.

Abstract

Autonomous driving has rapidly developed and shown promising performance due to recent advances in hardware and deep learning techniques. High-quality datasets are fundamental for developing reliable autonomous driving algorithms. Previous dataset surveys either focused on a limited number or lacked detailed investigation of dataset characteristics. To this end, we present an exhaustive study of 265 autonomous driving datasets from multiple perspectives, including sensor modalities, data size, tasks, and contextual conditions. We introduce a novel metric to evaluate the impact of datasets, which can also be a guide for creating new datasets. Besides, we analyze the annotation processes, existing labeling tools, and the annotation quality of datasets, showing the importance of establishing a standard annotation pipeline. On the other hand, we thoroughly analyze the impact of geographical and adversarial environmental conditions on the performance of autonomous driving systems. Moreover, we exhibit the data distribution of several vital datasets and discuss their pros and cons accordingly. Finally, we discuss the current challenges and the development trend of the future autonomous driving datasets.

A Survey on Autonomous Driving Datasets: Statistics, Annotation Quality, and a Future Outlook

TL;DR

The paper thoroughly surveys 265 autonomous driving datasets, introducing an impact score to quantify dataset influence and guiding future dataset creation. It analyzes sensor modalities, sensing domains, annotation pipelines, and data distribution, while assessing adversarial environmental effects on performance. The work highlights high-influence datasets across perception, prediction, planning, control, and end-to-end driving, and discusses future directions such as VLM-based data generation, domain adaptation, and open data ecosystems. This comprehensive resource provides a structured foundation for dataset selection, standardization, and the design of next-generation autonomous driving benchmarks with broader geographic and environmental coverage.

Abstract

Autonomous driving has rapidly developed and shown promising performance due to recent advances in hardware and deep learning techniques. High-quality datasets are fundamental for developing reliable autonomous driving algorithms. Previous dataset surveys either focused on a limited number or lacked detailed investigation of dataset characteristics. To this end, we present an exhaustive study of 265 autonomous driving datasets from multiple perspectives, including sensor modalities, data size, tasks, and contextual conditions. We introduce a novel metric to evaluate the impact of datasets, which can also be a guide for creating new datasets. Besides, we analyze the annotation processes, existing labeling tools, and the annotation quality of datasets, showing the importance of establishing a standard annotation pipeline. On the other hand, we thoroughly analyze the impact of geographical and adversarial environmental conditions on the performance of autonomous driving systems. Moreover, we exhibit the data distribution of several vital datasets and discuss their pros and cons accordingly. Finally, we discuss the current challenges and the development trend of the future autonomous driving datasets.
Paper Structure (38 sections, 11 equations, 15 figures, 7 tables)

This paper contains 38 sections, 11 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Bird's-Eye View object distribution of datasets. Each heatmap represents a dataset and is plotted using X and Y coordinates. Y is the driving direction of the ego-vehicle. The unique annotation characters of each dataset are reflected in the distribution range, density, and number of bounding boxes.
  • Figure 2: Overview of dataset publication trends from 2008 to 2024. The diagram demonstrates a significant increase in the publication of onboard datasets between 2015 and 2020, followed by a gradual decline thereafter. In contrast, there has been a rising trend in the publication of V2X datasets, indicating growing research interest in cooperative perception systems.
  • Figure 3: This survey's primary taxonomy includes impact score, sensors and modalities, autonomous driving tasks, high-influence datasets, and annotation process.
  • Figure 4: Sensors on autonomous driving vehicles. The type of each sensor is (a) Camera: Basler ace acA1600-20uc, (b) LiDAR: Velodyne Puck LITE, (c) Radar: Ainstein Launches K-79, (d) Event-based camera: Evaluation Kit 4 HD, (e) IMU: IMU383_Aceinna-W and (f) Thermal camera: FLIR_2nd_Gen_ADK. All figures are extracted from the websites hosting the sensors.
  • Figure 5: We present the sensor modalities to provide an intuitive understanding of each sensor's characteristics. (a) is from nuScenes caesar2020nuscenes, (b) is from KITTI geiger2012we, (c) is from weng2023all, (d) is from gehrig2021dsec, (e) is from flir. All figures are collected from the open-source data of datasets.
  • ...and 10 more figures