Table of Contents
Fetching ...

Causal Composition Diffusion Model for Closed-loop Traffic Generation

Haohong Lin, Xin Huang, Tung Phan-Minh, David S. Hayden, Huan Zhang, Ding Zhao, Siddhartha Srinivasa, Eric M. Wolff, Hongge Chen

TL;DR

This work addresses the challenge of generating traffic scenarios for autonomous vehicle safety that are simultaneously realistic and controllable over long horizons. It introduces CCDiff, a structure-guided diffusion model built on a Constrained Factored MDP and a learned Decision Causal Graph, augmented with Realism Constrained Score Matching and causal composition guidance. Empirical results on nuScenes and closed-loop simulators show CCDiff achieving superior realism and controllability compared with SOTA baselines, with improved metrics such as collision rate, off-road rate, FDE, and comfort. The approach provides interpretable causal structure for traffic reasoning and offers a scalable framework for safety-critical scenario generation, with potential for integration of larger foundation models and causal benchmarks.

Abstract

Simulation is critical for safety evaluation in autonomous driving, particularly in capturing complex interactive behaviors. However, generating realistic and controllable traffic scenarios in long-tail situations remains a significant challenge. Existing generative models suffer from the conflicting objective between user-defined controllability and realism constraints, which is amplified in safety-critical contexts. In this work, we introduce the Causal Compositional Diffusion Model (CCDiff), a structure-guided diffusion framework to address these challenges. We first formulate the learning of controllable and realistic closed-loop simulation as a constrained optimization problem. Then, CCDiff maximizes controllability while adhering to realism by automatically identifying and injecting causal structures directly into the diffusion process, providing structured guidance to enhance both realism and controllability. Through rigorous evaluations on benchmark datasets and in a closed-loop simulator, CCDiff demonstrates substantial gains over state-of-the-art approaches in generating realistic and user-preferred trajectories. Our results show CCDiff's effectiveness in extracting and leveraging causal structures, showing improved closed-loop performance based on key metrics such as collision rate, off-road rate, FDE, and comfort.

Causal Composition Diffusion Model for Closed-loop Traffic Generation

TL;DR

This work addresses the challenge of generating traffic scenarios for autonomous vehicle safety that are simultaneously realistic and controllable over long horizons. It introduces CCDiff, a structure-guided diffusion model built on a Constrained Factored MDP and a learned Decision Causal Graph, augmented with Realism Constrained Score Matching and causal composition guidance. Empirical results on nuScenes and closed-loop simulators show CCDiff achieving superior realism and controllability compared with SOTA baselines, with improved metrics such as collision rate, off-road rate, FDE, and comfort. The approach provides interpretable causal structure for traffic reasoning and offers a scalable framework for safety-critical scenario generation, with potential for integration of larger foundation models and causal benchmarks.

Abstract

Simulation is critical for safety evaluation in autonomous driving, particularly in capturing complex interactive behaviors. However, generating realistic and controllable traffic scenarios in long-tail situations remains a significant challenge. Existing generative models suffer from the conflicting objective between user-defined controllability and realism constraints, which is amplified in safety-critical contexts. In this work, we introduce the Causal Compositional Diffusion Model (CCDiff), a structure-guided diffusion framework to address these challenges. We first formulate the learning of controllable and realistic closed-loop simulation as a constrained optimization problem. Then, CCDiff maximizes controllability while adhering to realism by automatically identifying and injecting causal structures directly into the diffusion process, providing structured guidance to enhance both realism and controllability. Through rigorous evaluations on benchmark datasets and in a closed-loop simulator, CCDiff demonstrates substantial gains over state-of-the-art approaches in generating realistic and user-preferred trajectories. Our results show CCDiff's effectiveness in extracting and leveraging causal structures, showing improved closed-loop performance based on key metrics such as collision rate, off-road rate, FDE, and comfort.

Paper Structure

This paper contains 67 sections, 16 equations, 29 figures, 9 tables, 3 algorithms.

Figures (29)

  • Figure 1: Comparison of safety-critical scenario generation methods, featuring CCDiff alongside existing methods (STRIVE, CTG, and TrafficSim). The illustrated scenario involves Car 13 executing an unprotected left turn, prompting Car 7 to change lanes and interfere with Car 5. Unlike other methods, CCDiff successfully achieves both realism and controllability in generating this safety-critical scenario. In the right column, CCDiff's spatial reasoning method is compared to a distance-based baseline approach. CCDiff accurately captures the causal relationships between key agents, identifying crucial interactions with greater precision and spatial alignment than distance-based reasoning.
  • Figure 2: (a): Overview of Causal Composition Diffusion Model. The scene encoder encodes the history and then uses causal reasoning for a structured scene encoding and causal ranking. Finally, we exert guidance only to the top-K agents and eliminate the non-causal agents that would not contribute to the guidance objective to maintain better realism. (b): Summing up the score functions over all the agents achieves sub-optimal performance due to the conflict between the gradients of realism and controllability objectives.
  • Figure 3: Detailed model structure of CCDiff, which incorporates temporal tokenizer, spatial attention, and action decoding. The decision causal graph helps to extract the spatial patterns to identify the most relevant actions, then use the ranking outputs to mask the output of the action. means trainable modules, and means non-trainable parts during training.
  • Figure 4: Plot of the controllability v.s. realism in the multi-agent and long-horizon generation settings. CCDiff outperforms baselines in both the Generational Distance (GD) and Inverted Generational Distance (IGD), with better proximity to the Pareto frontier, and better coverage of the optimal solution along the frontier in this multi-objective optimization. Our method is more realistic and controllable compared to other approaches consistently in both multi-agent scenario generation and long-horizon scenario generation.
  • Figure 5: (a) Lane-changing at an intersection; (b, c, d) Interpretable computation of DCG from TTC mask and attention map.
  • ...and 24 more figures

Theorems & Definitions (2)

  • Definition 1: Constrained Factored MDP
  • Definition 2: Decision Causal Graph