Rolling Ahead Diffusion for Traffic Scene Simulation
Yunpeng Liu, Matthew Niedoba, William Harvey, Adam Scibior, Berend Zwartsenberg, Frank Wood
TL;DR
The paper addresses reactive, joint traffic scene generation for autonomous driving by combining diffusion models with autoregressive planning. It introduces Rolling Ahead Diffusion (RoAD), a rolling-window diffusion framework that partially denoises future steps while fully denoising the next step, enabling efficient closed-loop planning with strong reactivity. Built on a transformer-based score model and map-conditioned inputs, RoAD outperforms single-step AR baselines and approaches joint-scene diffusion models in accuracy while reducing compute, and it demonstrates robust reactivity against adversarial ego behavior. The approach offers a practical path to realistic, responsive traffic simulation suitable for MPC-like closed-loop evaluation and planning, with conditioning augmentation playing a key role in training stability.
Abstract
Realistic driving simulation requires that NPCs not only mimic natural driving behaviors but also react to the behavior of other simulated agents. Recent developments in diffusion-based scenario generation focus on creating diverse and realistic traffic scenarios by jointly modelling the motion of all the agents in the scene. However, these traffic scenarios do not react when the motion of agents deviates from their modelled trajectories. For example, the ego-agent can be controlled by a stand along motion planner. To produce reactive scenarios with joint scenario models, the model must regenerate the scenario at each timestep based on new observations in a Model Predictive Control (MPC) fashion. Although reactive, this method is time-consuming, as one complete possible future for all NPCs is generated per simulation step. Alternatively, one can utilize an autoregressive model (AR) to predict only the immediate next-step future for all NPCs. Although faster, this method lacks the capability for advanced planning. We present a rolling diffusion based traffic scene generation model which mixes the benefits of both methods by predicting the next step future and simultaneously predicting partially noised further future steps at the same time. We show that such model is efficient compared to diffusion model based AR, achieving a beneficial compromise between reactivity and computational efficiency.
