SimGen: Simulator-conditioned Driving Scene Generation
Yunsong Zhou, Michael Simon, Zhenghao Peng, Sicheng Mo, Hongzi Zhu, Minyi Guo, Bolei Zhou
TL;DR
SimGen introduces a cascade diffusion framework conditioned on both real-world data and driving simulators to generate diverse, controllable driving scenes from text prompts and simulator layouts. By transforming simulator conditions into realistic ones via CondDiff before feeding them to a diffusion model, and by unifying multiple modalities with adapters, SimGen mitigates sim2real gaps and condition conflicts. The authors also present the DIVA dataset, blending web and simulated driving data to improve diversity, and demonstrate that synthetic data from SimGen enhances perception tasks like BEV detection and segmentation. The work advances simulation-to-reality data generation and opens avenues for safety-critical data synthesis and closed-loop evaluation in autonomous driving.
Abstract
Controllable synthetic data generation can substantially lower the annotation cost of training data. Prior works use diffusion models to generate driving images conditioned on the 3D object layout. However, those models are trained on small-scale datasets like nuScenes, which lack appearance and layout diversity. Moreover, overfitting often happens, where the trained models can only generate images based on the layout data from the validation set of the same dataset. In this work, we introduce a simulator-conditioned scene generation framework called SimGen that can learn to generate diverse driving scenes by mixing data from the simulator and the real world. It uses a novel cascade diffusion pipeline to address challenging sim-to-real gaps and multi-condition conflicts. A driving video dataset DIVA is collected to enhance the generative diversity of SimGen, which contains over 147.5 hours of real-world driving videos from 73 locations worldwide and simulated driving data from the MetaDrive simulator. SimGen achieves superior generation quality and diversity while preserving controllability based on the text prompt and the layout pulled from a simulator. We further demonstrate the improvements brought by SimGen for synthetic data augmentation on the BEV detection and segmentation task and showcase its capability in safety-critical data generation.
