Table of Contents
Fetching ...

Risk-Controllable Multi-View Diffusion for Driving Scenario Generation

Hongyi Lin, Wenxiu Shi, Heye Huang, Dingyi Zhuang, Song Zhang, Yang Liu, Xiaobo Qu, Jinhua Zhao

Abstract

Generating safety-critical driving scenarios is crucial for evaluating and improving autonomous driving systems, but long-tail risky situations are rarely observed in real-world data and difficult to specify through manual scenario design. Existing generative approaches typically treat risk as an after-the-fact label and struggle to maintain geometric consistency in multi-view driving scenes. We present RiskMV-DPO, a general and systematic pipeline for physically-informed, risk-controllable multi-view scenario generation. By integrating target risk levels with physically-grounded risk modeling, we autonomously synthesize diverse and high-stakes dynamic trajectories that serve as explicit geometric anchors for a diffusion-based video generator. To ensure spatial-temporal coherence and geometric fidelity, we introduce a geometry-appearance alignment module and a region-aware direct preference optimization (RA-DPO) strategy with motion-aware masking to focus learning on localized dynamic regions.Experiments on the nuScenes dataset show that RiskMV-DPO can freely generate a wide spectrum of diverse long-tail scenarios while maintaining state-of-the-art visual quality, improving 3D detection mAP from 18.17 to 30.50 and reducing FID to 15.70. Our work shifts the role of world models from passive environment prediction to proactive, risk-controllable synthesis, providing a scalable toolchain for the safety-oriented development of embodied intelligence.

Risk-Controllable Multi-View Diffusion for Driving Scenario Generation

Abstract

Generating safety-critical driving scenarios is crucial for evaluating and improving autonomous driving systems, but long-tail risky situations are rarely observed in real-world data and difficult to specify through manual scenario design. Existing generative approaches typically treat risk as an after-the-fact label and struggle to maintain geometric consistency in multi-view driving scenes. We present RiskMV-DPO, a general and systematic pipeline for physically-informed, risk-controllable multi-view scenario generation. By integrating target risk levels with physically-grounded risk modeling, we autonomously synthesize diverse and high-stakes dynamic trajectories that serve as explicit geometric anchors for a diffusion-based video generator. To ensure spatial-temporal coherence and geometric fidelity, we introduce a geometry-appearance alignment module and a region-aware direct preference optimization (RA-DPO) strategy with motion-aware masking to focus learning on localized dynamic regions.Experiments on the nuScenes dataset show that RiskMV-DPO can freely generate a wide spectrum of diverse long-tail scenarios while maintaining state-of-the-art visual quality, improving 3D detection mAP from 18.17 to 30.50 and reducing FID to 15.70. Our work shifts the role of world models from passive environment prediction to proactive, risk-controllable synthesis, providing a scalable toolchain for the safety-oriented development of embodied intelligence.
Paper Structure (33 sections, 31 equations, 5 figures, 2 tables)

This paper contains 33 sections, 31 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Examples of mined potential risks in typical left-turn scenarios using the proposed per-frame risk quantification. The numbers indicate the risk coefficient of the ego vehicle at that location. (a) Unprotected left turn at an unsignalized intersection with an oncoming straight-moving vehicle. (b) Parallel left turns with tight lateral clearance. (c) Left turn followed by a nearby parking-lot exit that creates a local blind spot.
  • Figure 2: Overview of the proposed RiskMV-DPO framework. Given multi-view observations and scenario context, trajectories and 3D bounding boxes generated by the risk control module at a specified risk level are used as structured motion conditions. A multimodal encoder embeds view tokens and motion cues, which are then injected into a diffusion backbone composed of spatial and temporal STDiT3 blocks. Region-aware DPO further aligns the generation toward localized dynamic regions, producing temporally coherent multi-view driving videos consistent with the specified risk-conditioned motion.
  • Figure 3: 0.95 quantile (higher risk)
  • Figure 4: 0.8 quantile (medium risk)
  • Figure 5: 0.2 quantile (lower risk)