Table of Contents
Fetching ...

Synthetic Datasets for Autonomous Driving: A Survey

Zhihang Song, Zimin He, Xingyu Li, Qiming Ma, Ruibo Ming, Zhiqi Mao, Huaxin Pei, Lihui Peng, Jianming Hu, Danya Yao, Yi Zhang

TL;DR

This survey addresses the challenge of data scarcity in autonomous driving by cataloging synthetic datasets and organizing them into single-task and multi-task categories. It proposes an evaluation framework that separates static data quality from interactive task-driven effects and introduces a feedback-driven, systematic process for generating trustworthy synthetic data, incorporating safety-oriented testing. The paper highlights domain adaptation, gap-filling techniques, and trustworthiness considerations (e.g., label reliability, V&V) as central contributions, and demonstrates insights through experiments on VKITTI and SHIFT. Overall, synthetic data are shown to complement real-world data, enabling controlled testing, robust domain-shift analysis, and safer, more reliable autonomous driving systems, while outlining concrete directions for improving realism, scope, and evaluation standards. A relative robustness metric $R$ is used to quantify transferability across domains, defined as $R = rac{Shift ext{ }Performance}{Original ext{ }Performance}$.

Abstract

Autonomous driving techniques have been flourishing in recent years while thirsting for huge amounts of high-quality data. However, it is difficult for real-world datasets to keep up with the pace of changing requirements due to their expensive and time-consuming experimental and labeling costs. Therefore, more and more researchers are turning to synthetic datasets to easily generate rich and changeable data as an effective complement to the real world and to improve the performance of algorithms. In this paper, we summarize the evolution of synthetic dataset generation methods and review the work to date in synthetic datasets related to single and multi-task categories for to autonomous driving study. We also discuss the role that synthetic dataset plays the evaluation, gap test, and positive effect in autonomous driving related algorithm testing, especially on trustworthiness and safety aspects. Finally, we discuss general trends and possible development directions. To the best of our knowledge, this is the first survey focusing on the application of synthetic datasets in autonomous driving. This survey also raises awareness of the problems of real-world deployment of autonomous driving technology and provides researchers with a possible solution.

Synthetic Datasets for Autonomous Driving: A Survey

TL;DR

This survey addresses the challenge of data scarcity in autonomous driving by cataloging synthetic datasets and organizing them into single-task and multi-task categories. It proposes an evaluation framework that separates static data quality from interactive task-driven effects and introduces a feedback-driven, systematic process for generating trustworthy synthetic data, incorporating safety-oriented testing. The paper highlights domain adaptation, gap-filling techniques, and trustworthiness considerations (e.g., label reliability, V&V) as central contributions, and demonstrates insights through experiments on VKITTI and SHIFT. Overall, synthetic data are shown to complement real-world data, enabling controlled testing, robust domain-shift analysis, and safer, more reliable autonomous driving systems, while outlining concrete directions for improving realism, scope, and evaluation standards. A relative robustness metric is used to quantify transferability across domains, defined as .

Abstract

Autonomous driving techniques have been flourishing in recent years while thirsting for huge amounts of high-quality data. However, it is difficult for real-world datasets to keep up with the pace of changing requirements due to their expensive and time-consuming experimental and labeling costs. Therefore, more and more researchers are turning to synthetic datasets to easily generate rich and changeable data as an effective complement to the real world and to improve the performance of algorithms. In this paper, we summarize the evolution of synthetic dataset generation methods and review the work to date in synthetic datasets related to single and multi-task categories for to autonomous driving study. We also discuss the role that synthetic dataset plays the evaluation, gap test, and positive effect in autonomous driving related algorithm testing, especially on trustworthiness and safety aspects. Finally, we discuss general trends and possible development directions. To the best of our knowledge, this is the first survey focusing on the application of synthetic datasets in autonomous driving. This survey also raises awareness of the problems of real-world deployment of autonomous driving technology and provides researchers with a possible solution.
Paper Structure (37 sections, 7 figures, 5 tables)

This paper contains 37 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Different development stages of synthetic dataset generation for autonomous driving perception related tasks.
  • Figure 2: Samples of single-task synthetic datasets. a) FRIDA, b) MPI Sintel, c) Flying Things, d) GTA-V dataset, e) SYNTHIA, f) VEIS, g) Foggy Cityscapes, h) IDDA, i) CarlaScenes.
  • Figure 3: Samples of multi-task synthetic datasets. a) Virtual KITTI, b) VIPER, c) ParallelEye, d) PreSIL, e) Virtual KITTI2, f) SHIFT, g) V2X-Sim, h) AIODrive, i) OPV2V.
  • Figure 4: Static elements and interactive elements for multi-evaluating synthetic datasets.
  • Figure 5: Results of domain shift on KITTI and different situations in BDD100K (mAP50). We trained a YOLOv5s car detection model on Virtual KITTI training set and tested its performance on KITTI and different time and weather data splits of the BDD100K validation set.
  • ...and 2 more figures