Synthetic Datasets for Autonomous Driving: A Survey
Zhihang Song, Zimin He, Xingyu Li, Qiming Ma, Ruibo Ming, Zhiqi Mao, Huaxin Pei, Lihui Peng, Jianming Hu, Danya Yao, Yi Zhang
TL;DR
This survey addresses the challenge of data scarcity in autonomous driving by cataloging synthetic datasets and organizing them into single-task and multi-task categories. It proposes an evaluation framework that separates static data quality from interactive task-driven effects and introduces a feedback-driven, systematic process for generating trustworthy synthetic data, incorporating safety-oriented testing. The paper highlights domain adaptation, gap-filling techniques, and trustworthiness considerations (e.g., label reliability, V&V) as central contributions, and demonstrates insights through experiments on VKITTI and SHIFT. Overall, synthetic data are shown to complement real-world data, enabling controlled testing, robust domain-shift analysis, and safer, more reliable autonomous driving systems, while outlining concrete directions for improving realism, scope, and evaluation standards. A relative robustness metric $R$ is used to quantify transferability across domains, defined as $R = rac{Shift ext{ }Performance}{Original ext{ }Performance}$.
Abstract
Autonomous driving techniques have been flourishing in recent years while thirsting for huge amounts of high-quality data. However, it is difficult for real-world datasets to keep up with the pace of changing requirements due to their expensive and time-consuming experimental and labeling costs. Therefore, more and more researchers are turning to synthetic datasets to easily generate rich and changeable data as an effective complement to the real world and to improve the performance of algorithms. In this paper, we summarize the evolution of synthetic dataset generation methods and review the work to date in synthetic datasets related to single and multi-task categories for to autonomous driving study. We also discuss the role that synthetic dataset plays the evaluation, gap test, and positive effect in autonomous driving related algorithm testing, especially on trustworthiness and safety aspects. Finally, we discuss general trends and possible development directions. To the best of our knowledge, this is the first survey focusing on the application of synthetic datasets in autonomous driving. This survey also raises awareness of the problems of real-world deployment of autonomous driving technology and provides researchers with a possible solution.
