Table of Contents
Fetching ...

Rolling Ahead Diffusion for Traffic Scene Simulation

Yunpeng Liu, Matthew Niedoba, William Harvey, Adam Scibior, Berend Zwartsenberg, Frank Wood

TL;DR

The paper addresses reactive, joint traffic scene generation for autonomous driving by combining diffusion models with autoregressive planning. It introduces Rolling Ahead Diffusion (RoAD), a rolling-window diffusion framework that partially denoises future steps while fully denoising the next step, enabling efficient closed-loop planning with strong reactivity. Built on a transformer-based score model and map-conditioned inputs, RoAD outperforms single-step AR baselines and approaches joint-scene diffusion models in accuracy while reducing compute, and it demonstrates robust reactivity against adversarial ego behavior. The approach offers a practical path to realistic, responsive traffic simulation suitable for MPC-like closed-loop evaluation and planning, with conditioning augmentation playing a key role in training stability.

Abstract

Realistic driving simulation requires that NPCs not only mimic natural driving behaviors but also react to the behavior of other simulated agents. Recent developments in diffusion-based scenario generation focus on creating diverse and realistic traffic scenarios by jointly modelling the motion of all the agents in the scene. However, these traffic scenarios do not react when the motion of agents deviates from their modelled trajectories. For example, the ego-agent can be controlled by a stand along motion planner. To produce reactive scenarios with joint scenario models, the model must regenerate the scenario at each timestep based on new observations in a Model Predictive Control (MPC) fashion. Although reactive, this method is time-consuming, as one complete possible future for all NPCs is generated per simulation step. Alternatively, one can utilize an autoregressive model (AR) to predict only the immediate next-step future for all NPCs. Although faster, this method lacks the capability for advanced planning. We present a rolling diffusion based traffic scene generation model which mixes the benefits of both methods by predicting the next step future and simultaneously predicting partially noised further future steps at the same time. We show that such model is efficient compared to diffusion model based AR, achieving a beneficial compromise between reactivity and computational efficiency.

Rolling Ahead Diffusion for Traffic Scene Simulation

TL;DR

The paper addresses reactive, joint traffic scene generation for autonomous driving by combining diffusion models with autoregressive planning. It introduces Rolling Ahead Diffusion (RoAD), a rolling-window diffusion framework that partially denoises future steps while fully denoising the next step, enabling efficient closed-loop planning with strong reactivity. Built on a transformer-based score model and map-conditioned inputs, RoAD outperforms single-step AR baselines and approaches joint-scene diffusion models in accuracy while reducing compute, and it demonstrates robust reactivity against adversarial ego behavior. The approach offers a practical path to realistic, responsive traffic simulation suitable for MPC-like closed-loop evaluation and planning, with conditioning augmentation playing a key role in training stability.

Abstract

Realistic driving simulation requires that NPCs not only mimic natural driving behaviors but also react to the behavior of other simulated agents. Recent developments in diffusion-based scenario generation focus on creating diverse and realistic traffic scenarios by jointly modelling the motion of all the agents in the scene. However, these traffic scenarios do not react when the motion of agents deviates from their modelled trajectories. For example, the ego-agent can be controlled by a stand along motion planner. To produce reactive scenarios with joint scenario models, the model must regenerate the scenario at each timestep based on new observations in a Model Predictive Control (MPC) fashion. Although reactive, this method is time-consuming, as one complete possible future for all NPCs is generated per simulation step. Alternatively, one can utilize an autoregressive model (AR) to predict only the immediate next-step future for all NPCs. Although faster, this method lacks the capability for advanced planning. We present a rolling diffusion based traffic scene generation model which mixes the benefits of both methods by predicting the next step future and simultaneously predicting partially noised further future steps at the same time. We show that such model is efficient compared to diffusion model based AR, achieving a beneficial compromise between reactivity and computational efficiency.

Paper Structure

This paper contains 27 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: From top to bottom row: DJINN niedoba2024diffusion, autoregressive (AR), and RoAD (Ours). The adversarial agent, marked with a red dot, follows its replay log and slows down to reach only half its trajectory by the end of the simulation. Brown circles highlight the interaction region. The agents controlled by the RoAD and AR models slow down to react to the adversarial agent, while agents controlled by the DJINN model do not. Ground truth trajectories are shown in gray, and predicted trajectories are shown in orange.
  • Figure 2: Rolling Diffusion Model. Columns represent sequence timesteps and rows represent diffusion timesteps. Circles are shown in white if the corresponding sequence timestep is fully denoised; black if the sequence timestep is pure noise; and grey if in between. During the denoising process, the SNR for each element in the rolling window depends on the local diffusion time $\tau_w$ which can be calculated using \ref{['eq:taww_warm']} or \ref{['eq:taww_rolling']}, depending on whether it is in the warm-up or rolling stage.
  • Figure 3: From Top to Bottom row, AR, RoAD-20. By looking ahead of the subsequent step, the pedestrian marked with a red dot controlled by RoAD-20 planner avoided colliding with the vehicle. Brown circles highlight the interaction region. Grey trajectories denote replay logs and orange trajectories are the full predicted future. This example demonstrates that RoAD-20, with a longer planning horizon compared to AR can anticipate and mitigate interactions with other agents effectively.
  • Figure 4: From top to bottom row: RoAD-20, RoAD-15. The adversarial agent, marked with a red dot, follows its replay log and slows down to reach only half its trajectory by the end of the simulation. Brown circles highlight the interaction region. RoAD-15 achieves better reactivity than RoAD-20, as reducing the window size causes the model to denoise the next element from a lower signal-to-noise ratio (SNR), which provides the model with greater flexibility to adjust to the adversarial agent.