Table of Contents
Fetching ...

Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework

Xinhao Xiang, Abhijeet Rastogi, Jiawei Zhang

TL;DR

<3-5 sentence high-level summary> This paper tackles the risk that AI-generated driving videos (AIGVs) may harm autonomous driving (AD) models when used for training or evaluation. It introduces ADGV-Bench, a driving-focused benchmark with dense perception annotations, and ADGVE, a driving-aware evaluator that fuses static, temporal, lane, and Vision-Language checks to rate clip quality. The authors show that naive use of raw AIGVs degrades AD perception, while filtering with ADGVE improves downstream detection, tracking, and segmentation and enables AIGVs to complement real data. The work provides a practical, model-agnostic quality gate for safely integrating large-scale generated driving videos into AD pipelines.

Abstract

Recent text-to-video models have enabled the generation of high-resolution driving scenes from natural language prompts. These AI-generated driving videos (AIGVs) offer a low-cost, scalable alternative to real or simulator data for autonomous driving (AD). But a key question remains: can such videos reliably support training and evaluation of AD models? We present a diagnostic framework that systematically studies this question. First, we introduce a taxonomy of frequent AIGV failure modes, including visual artifacts, physically implausible motion, and violations of traffic semantics, and demonstrate their negative impact on object detection, tracking, and instance segmentation. To support this analysis, we build ADGV-Bench, a driving-focused benchmark with human quality annotations and dense labels for multiple perception tasks. We then propose ADGVE, a driving-aware evaluator that combines static semantics, temporal cues, lane obedience signals, and Vision-Language Model(VLM)-guided reasoning into a single quality score for each clip. Experiments show that blindly adding raw AIGVs can degrade perception performance, while filtering them with ADGVE consistently improves both general video quality assessment metrics and downstream AD models, and turns AIGVs into a beneficial complement to real-world data. Our study highlights both the risks and the promise of AIGVs, and provides practical tools for safely leveraging large-scale video generation in future AD pipelines.

Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework

TL;DR

<3-5 sentence high-level summary> This paper tackles the risk that AI-generated driving videos (AIGVs) may harm autonomous driving (AD) models when used for training or evaluation. It introduces ADGV-Bench, a driving-focused benchmark with dense perception annotations, and ADGVE, a driving-aware evaluator that fuses static, temporal, lane, and Vision-Language checks to rate clip quality. The authors show that naive use of raw AIGVs degrades AD perception, while filtering with ADGVE improves downstream detection, tracking, and segmentation and enables AIGVs to complement real data. The work provides a practical, model-agnostic quality gate for safely integrating large-scale generated driving videos into AD pipelines.

Abstract

Recent text-to-video models have enabled the generation of high-resolution driving scenes from natural language prompts. These AI-generated driving videos (AIGVs) offer a low-cost, scalable alternative to real or simulator data for autonomous driving (AD). But a key question remains: can such videos reliably support training and evaluation of AD models? We present a diagnostic framework that systematically studies this question. First, we introduce a taxonomy of frequent AIGV failure modes, including visual artifacts, physically implausible motion, and violations of traffic semantics, and demonstrate their negative impact on object detection, tracking, and instance segmentation. To support this analysis, we build ADGV-Bench, a driving-focused benchmark with human quality annotations and dense labels for multiple perception tasks. We then propose ADGVE, a driving-aware evaluator that combines static semantics, temporal cues, lane obedience signals, and Vision-Language Model(VLM)-guided reasoning into a single quality score for each clip. Experiments show that blindly adding raw AIGVs can degrade perception performance, while filtering them with ADGVE consistently improves both general video quality assessment metrics and downstream AD models, and turns AIGVs into a beneficial complement to real-world data. Our study highlights both the risks and the promise of AIGVs, and provides practical tools for safely leveraging large-scale video generation in future AD pipelines.

Paper Structure

This paper contains 49 sections, 3 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Overview of our pipeline. Prompt-only video generators (e.g., Veo 3, Sora, Pika) produce raw AI-generated driving videos (AIGVs). ADGVE diagnoses failure cases and filters out low-quality clips, yielding a high-quality subset that could combine with existing real datasets as additional training/testing data for downstream driving perception tasks such as object detection, multi-object tracking, and instance segmentation).
  • Figure 2: Challenges of currently AI-generated driving videos, with example failure cases. More failure cases are provided in Supp. \ref{['sec:challenges_more']}.
  • Figure 3: Overview of the ADGVE evaluator. Given an AI-generated driving video $v$, ADGVE extracts static object and infrastructure priors (Physical Validity, Semantic Plausibility), temporal tracks and motion features (Object Consistency), and converts them into a visual bundle $\mathcal{V} = \{\mathbf{X}, \mathbf{C}, \mathbf{Y}, \mathbf{Y}'\}$ plus textual summaries $\mathcal{S}$ for a video-language model. A lane-obedience module computes an additional geometric score from lanes and trajectories. All VLM-derived scores and descriptors are fused into a single driving-aware quality score $S_{\text{overall}}$, which we use to filter low-quality AIGVs (default threshold $S_{\text{overall}} > 0.2$) and to analyze their impact on downstream driving perception.
  • Figure 4: ADGV-Bench collection process. We first use LLM to create prompts, then use these prompts to get AI-generated driving videos from various video generator models. After that we do manual-evaluation about each quality of video, getting quality scores, and annotate each frames with bounding boxes, tracks and masks under selected traffic related categories.
  • Figure 5: Additional example failure cases under challenges of currently AI-generated driving videos. We briefly describe the violation for each case. 1.1 Temporal Instability: (a) cyclist appearance flicker across frames; (b) cars changing shape between frames; (c) animal identity drift; (d) car front lights shifting positions. 1.2 Physical Inaccuracy: (a) melting traffic cones; (b) deformed delivery truck; (c) trees with impossible geometry; (d) unrealistic fog transitions; (e) bus with distorted body. 1.3 Unrealistic Artifacts: (a) floating cone base; (b) duplicated tram cabin. 2.1 Agent Behavior Violation: (a) car drifting out of lane without turn; (b) cyclist pedaling sideways; (c) truck making impossible sideways motion; (d) ego-followed car teleporting; (e) truck colliding with static object; (f) cyclist riding in the middle of the lane; (g) ambulance driving against lane markings. 2.2 Infrastructure Consistency: (a) inconsistent lane arrows; (b) traffic light with impossible color pattern; (c) misaligned traffic sign; (d) lane lines with broken connectivity; (e) zebra crossing with distorted perspective. 2.3 Ego Vehicle Impossibility: (a) ego lane bending inconsistently over time; (b) ego car driving on roundabout island; (c) ego car parked in impossible location; (d) ego camera drifting off road edge. The relative number of examples shown under each challenge roughly reflects how frequently that failure type appears in current AI-generated driving videos. Please enlarge the figure for the best viewing experience.
  • ...and 1 more figures