Table of Contents
Fetching ...

SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain

Jiawei Zhou, Linye Lyu, Zhuotao Tian, Cheng Zhuo, Yu Li

TL;DR

SafeMVDrive addresses the scarcity of real-world, multi-view safety-critical data for end-to-end autonomous driving by integrating a VLM-guided adversarial vehicle selector with a two-stage collision-evasion trajectory generator and a diffusion-based trajectory-to-video synthesizer. The approach yields high-quality, safety-critical, multi-view driving videos grounded in real data, demonstrated on NuScenes with a 41-scene dataset and public release. Key contributions include GRPO-finetuned VLM-based adversarial vehicle selection, a two-stage, video-compatible trajectory generation pipeline, and a diffusion-based multi-view video generator that significantly stresses the planning module of E2E AD systems. The resulting data enables robust stress-testing and evaluation of autonomous driving planners in realistic, multi-view scenarios, offering practical impact for safety validation and system development.

Abstract

Safety-critical scenarios are rare yet pivotal for evaluating and enhancing the robustness of autonomous driving systems. While existing methods generate safety-critical driving trajectories, simulations, or single-view videos, they fall short of meeting the demands of advanced end-to-end autonomous systems (E2E AD), which require real-world, multi-view video data. To bridge this gap, we introduce SafeMVDrive, the first framework designed to generate high-quality, safety-critical, multi-view driving videos grounded in real-world domains. SafeMVDrive strategically integrates a safety-critical trajectory generator with an advanced multi-view video generator. To tackle the challenges inherent in this integration, we first enhance scene understanding ability of the trajectory generator by incorporating visual context -- which is previously unavailable to such generator -- and leveraging a GRPO-finetuned vision-language model to achieve more realistic and context-aware trajectory generation. Second, recognizing that existing multi-view video generators struggle to render realistic collision events, we introduce a two-stage, controllable trajectory generation mechanism that produces collision-evasion trajectories, ensuring both video quality and safety-critical fidelity. Finally, we employ a diffusion-based multi-view video generator to synthesize high-quality safety-critical driving videos from the generated trajectories. Experiments conducted on an E2E AD planner demonstrate a significant increase in collision rate when tested with our generated data, validating the effectiveness of SafeMVDrive in stress-testing planning modules. Our code, examples, and datasets are publicly available at: https://zhoujiawei3.github.io/SafeMVDrive/.

SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain

TL;DR

SafeMVDrive addresses the scarcity of real-world, multi-view safety-critical data for end-to-end autonomous driving by integrating a VLM-guided adversarial vehicle selector with a two-stage collision-evasion trajectory generator and a diffusion-based trajectory-to-video synthesizer. The approach yields high-quality, safety-critical, multi-view driving videos grounded in real data, demonstrated on NuScenes with a 41-scene dataset and public release. Key contributions include GRPO-finetuned VLM-based adversarial vehicle selection, a two-stage, video-compatible trajectory generation pipeline, and a diffusion-based multi-view video generator that significantly stresses the planning module of E2E AD systems. The resulting data enables robust stress-testing and evaluation of autonomous driving planners in realistic, multi-view scenarios, offering practical impact for safety validation and system development.

Abstract

Safety-critical scenarios are rare yet pivotal for evaluating and enhancing the robustness of autonomous driving systems. While existing methods generate safety-critical driving trajectories, simulations, or single-view videos, they fall short of meeting the demands of advanced end-to-end autonomous systems (E2E AD), which require real-world, multi-view video data. To bridge this gap, we introduce SafeMVDrive, the first framework designed to generate high-quality, safety-critical, multi-view driving videos grounded in real-world domains. SafeMVDrive strategically integrates a safety-critical trajectory generator with an advanced multi-view video generator. To tackle the challenges inherent in this integration, we first enhance scene understanding ability of the trajectory generator by incorporating visual context -- which is previously unavailable to such generator -- and leveraging a GRPO-finetuned vision-language model to achieve more realistic and context-aware trajectory generation. Second, recognizing that existing multi-view video generators struggle to render realistic collision events, we introduce a two-stage, controllable trajectory generation mechanism that produces collision-evasion trajectories, ensuring both video quality and safety-critical fidelity. Finally, we employ a diffusion-based multi-view video generator to synthesize high-quality safety-critical driving videos from the generated trajectories. Experiments conducted on an E2E AD planner demonstrate a significant increase in collision rate when tested with our generated data, validating the effectiveness of SafeMVDrive in stress-testing planning modules. Our code, examples, and datasets are publicly available at: https://zhoujiawei3.github.io/SafeMVDrive/.

Paper Structure

This paper contains 22 sections, 8 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Keyframes from diverse realistic, multi-view, safety-critical videos generated by SafeMVDrive. Red boxes indicate safety-critical vehicles involved in events like cut-ins, rapid rear approaches, and sudden braking. Additional video examples are available via the link provided in the abstract.
  • Figure 2: The SafeMVDrive framework for generating realism, multi-view safety-critical videos.
  • Figure 3: Comparison between the real-world scene (left) and the BEV-rendered non-visual data (right). Obstacles that physically prevent a collision between Vehicle 1 and the ego vehicle are visible in the real-world view but missing in the non-visual data, potentially misleading heuristic methods.
  • Figure 4: Comparison of videos generated by different methods, only showing front view. Origin is ordinary, Naive loses realism near the end, while only ours exhibits both realism and safety-criticality.
  • Figure 5: Adversarial vehicle selection examples using the GRPO-finetuned VLM. The VLM accurately analyzes spatial relationships between vehicles and makes reasonable selections.
  • ...and 2 more figures