Table of Contents
Fetching ...

SceneDiffuser: Efficient and Controllable Driving Simulation Initialization and Rollout

Chiyu Max Jiang, Yijing Bai, Andre Cornman, Christopher Davis, Xiukun Huang, Hong Jeon, Sakshum Kulshrestha, John Lambert, Shuangyu Li, Xuanyu Zhou, Carlos Fuertes, Chang Yuan, Mingxing Tan, Yin Zhou, Dragomir Anguelov

TL;DR

SceneDiffuser tackles realistic and controllable driving scene simulation by unifying initialization and closed-loop rollout under a single diffusion framework. It introduces amortized diffusion to drastically reduce per-step inference while mitigating drift in closed-loop rollout, and adds generalized hard constraints and LLM-guided constraints for controllable generation. The approach improves realism with model scaling and demonstrates strong performance on Waymo Open Sim Agent Challenge, achieving top open-loop and best diffusion-model closed-loop scores. This work enables efficient, scalable AV simulation with flexible constraint mechanisms.

Abstract

Realistic and interactive scene simulation is a key prerequisite for autonomous vehicle (AV) development. In this work, we present SceneDiffuser, a scene-level diffusion prior designed for traffic simulation. It offers a unified framework that addresses two key stages of simulation: scene initialization, which involves generating initial traffic layouts, and scene rollout, which encompasses the closed-loop simulation of agent behaviors. While diffusion models have been proven effective in learning realistic and multimodal agent distributions, several challenges remain, including controllability, maintaining realism in closed-loop simulations, and ensuring inference efficiency. To address these issues, we introduce amortized diffusion for simulation. This novel diffusion denoising paradigm amortizes the computational cost of denoising over future simulation steps, significantly reducing the cost per rollout step (16x less inference steps) while also mitigating closed-loop errors. We further enhance controllability through the introduction of generalized hard constraints, a simple yet effective inference-time constraint mechanism, as well as language-based constrained scene generation via few-shot prompting of a large language model (LLM). Our investigations into model scaling reveal that increased computational resources significantly improve overall simulation realism. We demonstrate the effectiveness of our approach on the Waymo Open Sim Agents Challenge, achieving top open-loop performance and the best closed-loop performance among diffusion models.

SceneDiffuser: Efficient and Controllable Driving Simulation Initialization and Rollout

TL;DR

SceneDiffuser tackles realistic and controllable driving scene simulation by unifying initialization and closed-loop rollout under a single diffusion framework. It introduces amortized diffusion to drastically reduce per-step inference while mitigating drift in closed-loop rollout, and adds generalized hard constraints and LLM-guided constraints for controllable generation. The approach improves realism with model scaling and demonstrates strong performance on Waymo Open Sim Agent Challenge, achieving top open-loop and best diffusion-model closed-loop scores. This work enables efficient, scalable AV simulation with flexible constraint mechanisms.

Abstract

Realistic and interactive scene simulation is a key prerequisite for autonomous vehicle (AV) development. In this work, we present SceneDiffuser, a scene-level diffusion prior designed for traffic simulation. It offers a unified framework that addresses two key stages of simulation: scene initialization, which involves generating initial traffic layouts, and scene rollout, which encompasses the closed-loop simulation of agent behaviors. While diffusion models have been proven effective in learning realistic and multimodal agent distributions, several challenges remain, including controllability, maintaining realism in closed-loop simulations, and ensuring inference efficiency. To address these issues, we introduce amortized diffusion for simulation. This novel diffusion denoising paradigm amortizes the computational cost of denoising over future simulation steps, significantly reducing the cost per rollout step (16x less inference steps) while also mitigating closed-loop errors. We further enhance controllability through the introduction of generalized hard constraints, a simple yet effective inference-time constraint mechanism, as well as language-based constrained scene generation via few-shot prompting of a large language model (LLM). Our investigations into model scaling reveal that increased computational resources significantly improve overall simulation realism. We demonstrate the effectiveness of our approach on the Waymo Open Sim Agents Challenge, achieving top open-loop performance and the best closed-loop performance among diffusion models.

Paper Structure

This paper contains 28 sections, 7 equations, 18 figures, 1 table, 3 algorithms.

Figures (18)

  • Figure 1: SceneDiffuser: a generative prior for simulation initialization via log perturbation, agent injection, and synthetic scene generation, and for efficient closed-loop simulation at 10Hz via amortized diffusion. It progressively refines initial trajectories throughout the rollout. Environment sim agents are in green-blue gradient (temporal progression), AV agent in orange-yellow, and synthetic agents in red-purple.
  • Figure 2: We formulate various different tasks, including behavior prediction, conditional scenegen and unconditional scenegen as inpainting tasks on the scene tensor. We represent the scene tensor as a normalized tensor $x\in\mathbb{R}^{A\times \mathcal{T}\times D}$, for the number of agents, timesteps and feature dimensions.
  • Figure 3: SceneDiffuser architecture. Global scene context is encoded into a fixed number of $N_c$ tokens via a Perceiver IO Jaegle2021_Perceiver encoder. The noisy scene tokens are fused with local and global context, then used to condition a spatiotemporal transformer-based backbone Vaswani17nips_AttentionIsAllYouNeed via Adaptive LayerNorm (AdaLN) Peebles23iccv_DiT. Input/output tensor are in green, context tensors in blue, and ops in italics.
  • Figure 4: Amortized diffusion rollout procedure. The warm up step initializes the future predictions for the entire future horizon, which is then perturbed by a monotonic noise schedule $\hat{t}$. The trajectory is iteratively denoised by one step at each simulation step.
  • Figure 5: We compare the influence of replan rate on performance for our Full AR and Amortized AR models. Circle radius $\propto$$\#$ inference calls over the simulation. At 10Hz, Amortized AR requires 16x less model inference per step and is more realistic compared to Full AR.
  • ...and 13 more figures