Table of Contents
Fetching ...

Decoupled Diffusion Sparks Adaptive Scene Generation

Yunsong Zhou, Naisheng Ye, William Ljungbergh, Tianyu Li, Jiazhi Yang, Zetong Yang, Hongzi Zhu, Christoffer Petersson, Hongyang Li

TL;DR

Nexus tackles the challenge of controllable and reactive driving-scene generation for autonomous systems by decoupling diffusion into goal-oriented and reactive pathways using independent noise states. It introduces noise-masking training to fuse low-noise goal cues with high-noise scene evolution and noise-aware scheduling to update scene tokens in real time. The authors also create Nexus-Data, a large corpus of safety-critical corner cases generated in simulation to improve generalization to rare scenarios. Empirically, Nexus achieves a 40% reduction in displacement error and, with data augmentation, a 20% improvement in closed-loop planning, outperforming prior diffusion-based world-generation approaches.

Abstract

Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20% through data augmentation and showcase its capability in safety-critical data generation.

Decoupled Diffusion Sparks Adaptive Scene Generation

TL;DR

Nexus tackles the challenge of controllable and reactive driving-scene generation for autonomous systems by decoupling diffusion into goal-oriented and reactive pathways using independent noise states. It introduces noise-masking training to fuse low-noise goal cues with high-noise scene evolution and noise-aware scheduling to update scene tokens in real time. The authors also create Nexus-Data, a large corpus of safety-critical corner cases generated in simulation to improve generalization to rare scenarios. Empirically, Nexus achieves a 40% reduction in displacement error and, with data augmentation, a 20% improvement in closed-loop planning, outperforming prior diffusion-based world-generation approaches.

Abstract

Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20% through data augmentation and showcase its capability in safety-critical data generation.

Paper Structure

This paper contains 26 sections, 6 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Nexus is a noise-decoupled prediction pipeline designed for adaptive driving scene generation, ensuring both timely reaction and goal-directed control. Unlike prior approaches that use (a) full-sequence denoising or (b) next-token prediction, (c) Nexus introduces independent yet structured noise states, enabling more controlled and interactive scene generation. It leverages low-noise goals to steer generation while incorporating environmental updates dynamically, which are captured in subsequent denoising.
  • Figure 2: Preliminary on the scene generation.(a) Current methods encode scenes with tokens for agent and map attributes, formulating scene generation as generating future agent tensors from historical ones conditioned on a global map tensor. (b) Diffusion models take the entire sequence as input, using hard masks to fix conditions and enable controllable generation via inpainting, yet fail in a timely reaction.
  • Figure 3: Framework of Nexus.(a) Nexus learns from realistic and safety-critical driving logs and encodes agents and maps separately before feeding them into a diffusion transformer. The model is trained to restore sequences from partially masked agent tokens guided by low-noise ones. (b) Agent tokens are encoded with time and denoising steps, then interact with the maps and dynamics via attention. (c) Tokens with varying noise are scheduled within a chunk for a timely reaction. Each denoising step updates and pops zero-noise tokens, replacing them with next-frame tokens to iteratively generate the scene.
  • Figure 4: Diagram of the scheduling strategy. An agent's noise varies between zero and one across timesteps, determining the balance between stochasticity and goal-driven guidance at each sampling step. (a) is hindered by excessive steps per frame. (b) reduces costs and follows guidance but can't react to abrupt changes. (c) distributes cost by progressively adding tokens to the active chunk only at the start of each step, ensuring smoother transitions and better reactivity. (d) enhances future guidance and reduces cost by completing the path from both ends when the goal is fixed. The last two are our options.
  • Figure 5: Nexus-Data construction. Nexus-Data employs scene records from the nuPlan dataset to reconstruct maps and agents in a simulator to ensure scene realism. It selects a neighbor vehicle to generate attack trajectories by adversarial learning zhang2023cat and filters out unrealistic cases.
  • ...and 8 more figures