Table of Contents
Fetching ...

SimGen: Simulator-conditioned Driving Scene Generation

Yunsong Zhou, Michael Simon, Zhenghao Peng, Sicheng Mo, Hongzi Zhu, Minyi Guo, Bolei Zhou

TL;DR

SimGen introduces a cascade diffusion framework conditioned on both real-world data and driving simulators to generate diverse, controllable driving scenes from text prompts and simulator layouts. By transforming simulator conditions into realistic ones via CondDiff before feeding them to a diffusion model, and by unifying multiple modalities with adapters, SimGen mitigates sim2real gaps and condition conflicts. The authors also present the DIVA dataset, blending web and simulated driving data to improve diversity, and demonstrate that synthetic data from SimGen enhances perception tasks like BEV detection and segmentation. The work advances simulation-to-reality data generation and opens avenues for safety-critical data synthesis and closed-loop evaluation in autonomous driving.

Abstract

Controllable synthetic data generation can substantially lower the annotation cost of training data. Prior works use diffusion models to generate driving images conditioned on the 3D object layout. However, those models are trained on small-scale datasets like nuScenes, which lack appearance and layout diversity. Moreover, overfitting often happens, where the trained models can only generate images based on the layout data from the validation set of the same dataset. In this work, we introduce a simulator-conditioned scene generation framework called SimGen that can learn to generate diverse driving scenes by mixing data from the simulator and the real world. It uses a novel cascade diffusion pipeline to address challenging sim-to-real gaps and multi-condition conflicts. A driving video dataset DIVA is collected to enhance the generative diversity of SimGen, which contains over 147.5 hours of real-world driving videos from 73 locations worldwide and simulated driving data from the MetaDrive simulator. SimGen achieves superior generation quality and diversity while preserving controllability based on the text prompt and the layout pulled from a simulator. We further demonstrate the improvements brought by SimGen for synthetic data augmentation on the BEV detection and segmentation task and showcase its capability in safety-critical data generation.

SimGen: Simulator-conditioned Driving Scene Generation

TL;DR

SimGen introduces a cascade diffusion framework conditioned on both real-world data and driving simulators to generate diverse, controllable driving scenes from text prompts and simulator layouts. By transforming simulator conditions into realistic ones via CondDiff before feeding them to a diffusion model, and by unifying multiple modalities with adapters, SimGen mitigates sim2real gaps and condition conflicts. The authors also present the DIVA dataset, blending web and simulated driving data to improve diversity, and demonstrate that synthetic data from SimGen enhances perception tasks like BEV detection and segmentation. The work advances simulation-to-reality data generation and opens avenues for safety-critical data synthesis and closed-loop evaluation in autonomous driving.

Abstract

Controllable synthetic data generation can substantially lower the annotation cost of training data. Prior works use diffusion models to generate driving images conditioned on the 3D object layout. However, those models are trained on small-scale datasets like nuScenes, which lack appearance and layout diversity. Moreover, overfitting often happens, where the trained models can only generate images based on the layout data from the validation set of the same dataset. In this work, we introduce a simulator-conditioned scene generation framework called SimGen that can learn to generate diverse driving scenes by mixing data from the simulator and the real world. It uses a novel cascade diffusion pipeline to address challenging sim-to-real gaps and multi-condition conflicts. A driving video dataset DIVA is collected to enhance the generative diversity of SimGen, which contains over 147.5 hours of real-world driving videos from 73 locations worldwide and simulated driving data from the MetaDrive simulator. SimGen achieves superior generation quality and diversity while preserving controllability based on the text prompt and the layout pulled from a simulator. We further demonstrate the improvements brought by SimGen for synthetic data augmentation on the BEV detection and segmentation task and showcase its capability in safety-critical data generation.
Paper Structure (36 sections, 4 equations, 22 figures, 11 tables)

This paper contains 36 sections, 4 equations, 22 figures, 11 tables.

Figures (22)

  • Figure 1: SimGen is a controllable scene generation paradigm conditioned on a simulator. It learns from real-world and simulated data and then can generate diverse driving scenes based on the simulator's control conditions and text prompt.
  • Figure 1: Comparing DIVA with relevant datasets on scale, diversity, and annotations.$^*$: perception subset. $^+$: including procedural generation li2022metadrive and safety-critical zhang2023cat data. Cts: countries; Seg: segmentation; Virt: virtual image.
  • Figure 2: Constructing DIVA dataset. DIVA-Real (left) comprises driving videos collected from YouTube. We apply a Vision Language Model to filter out noisy images via a checklist and utilize off-the-shelf models to annotate text, depth, and semantic labels. Meanwhile, DIVA-Sim (right) employs scene records and control policies in a simulator to create map elements and objects. It can generate digital twins of real-world data and safety-critical scenes. Then various kinds of sensors placed in the simulation produce multimodal images. Ren.:rendered; T.D.: top-down view. Numbers and letters indicate the sequence of processes.
  • Figure 2: Formats of conditions.Real/SimCond: depth and segmentation; ExtraCond: rendered RGB, instance maps, and top-down views.
  • Figure 3: Illustration of SimGen. SimGen processes text and scene record as inputs. The text is feature-encoded and utilized in the subsequent modules, whereas the scene record undergoes a simulator rendering into simulated depth and segmentation (SimCond) and extra conditions (ExtraCond). SimCond, coupled with the text features, is fed into the CondDiff module that converts SimCond into RealCond, representing real depth and segmentation. Eventually, the text features, RealCond, and ExtraCond are inputted into the ImgDiff module, where an Adapter merges multi-source conditions into a unified control condition and generates driving scene images.
  • ...and 17 more figures