Table of Contents
Fetching ...

Drive&Gen: Co-Evaluating End-to-End Driving and Video Generation Models

Jiahao Wang, Zhenpei Yang, Yijing Bai, Yingwei Li, Yuliang Zou, Bo Sun, Abhijit Kundu, Jose Lezama, Luna Yue Huang, Zehao Zhu, Jyh-Jing Hwang, Dragomir Anguelov, Mingxing Tan, Chiyu Max Jiang

TL;DR

Drive&Gen presents a co-evaluation framework that unites controllable diffusion-based video generation with an E2E driving planner to study realism gaps and domain generalization. By introducing the Behavior Permutation Test (BPT) and leveraging scene-layout conditioning, the authors quantify how synthetic videos influence planner outputs and enable targeted ODD testing. They show that synthetic data can effectively augment real data to improve E2E planner generalization, including under rainy and nighttime conditions, thereby enabling cost-effective expansion of AV services into new operational domains. The framework supports systematic evaluation of both generative realism and planning performance, guiding safer, more robust simulators and planning models for autonomous driving.

Abstract

Recent advances in generative models have sparked exciting new possibilities in the field of autonomous vehicles. Specifically, video generation models are now being explored as controllable virtual testing environments. Simultaneously, end-to-end (E2E) driving models have emerged as a streamlined alternative to conventional modular autonomous driving systems, gaining popularity for their simplicity and scalability. However, the application of these techniques to simulation and planning raises important questions. First, while video generation models can generate increasingly realistic videos, can these videos faithfully adhere to the specified conditions and be realistic enough for E2E autonomous planner evaluation? Second, given that data is crucial for understanding and controlling E2E planners, how can we gain deeper insights into their biases and improve their ability to generalize to out-of-distribution scenarios? In this work, we bridge the gap between the driving models and generative world models (Drive&Gen) to address these questions. We propose novel statistical measures leveraging E2E drivers to evaluate the realism of generated videos. By exploiting the controllability of the video generation model, we conduct targeted experiments to investigate distribution gaps affecting E2E planner performance. Finally, we show that synthetic data produced by the video generation model offers a cost-effective alternative to real-world data collection. This synthetic data effectively improves E2E model generalization beyond existing Operational Design Domains, facilitating the expansion of autonomous vehicle services into new operational contexts.

Drive&Gen: Co-Evaluating End-to-End Driving and Video Generation Models

TL;DR

Drive&Gen presents a co-evaluation framework that unites controllable diffusion-based video generation with an E2E driving planner to study realism gaps and domain generalization. By introducing the Behavior Permutation Test (BPT) and leveraging scene-layout conditioning, the authors quantify how synthetic videos influence planner outputs and enable targeted ODD testing. They show that synthetic data can effectively augment real data to improve E2E planner generalization, including under rainy and nighttime conditions, thereby enabling cost-effective expansion of AV services into new operational domains. The framework supports systematic evaluation of both generative realism and planning performance, guiding safer, more robust simulators and planning models for autonomous driving.

Abstract

Recent advances in generative models have sparked exciting new possibilities in the field of autonomous vehicles. Specifically, video generation models are now being explored as controllable virtual testing environments. Simultaneously, end-to-end (E2E) driving models have emerged as a streamlined alternative to conventional modular autonomous driving systems, gaining popularity for their simplicity and scalability. However, the application of these techniques to simulation and planning raises important questions. First, while video generation models can generate increasingly realistic videos, can these videos faithfully adhere to the specified conditions and be realistic enough for E2E autonomous planner evaluation? Second, given that data is crucial for understanding and controlling E2E planners, how can we gain deeper insights into their biases and improve their ability to generalize to out-of-distribution scenarios? In this work, we bridge the gap between the driving models and generative world models (Drive&Gen) to address these questions. We propose novel statistical measures leveraging E2E drivers to evaluate the realism of generated videos. By exploiting the controllability of the video generation model, we conduct targeted experiments to investigate distribution gaps affecting E2E planner performance. Finally, we show that synthetic data produced by the video generation model offers a cost-effective alternative to real-world data collection. This synthetic data effectively improves E2E model generalization beyond existing Operational Design Domains, facilitating the expansion of autonomous vehicle services into new operational contexts.

Paper Structure

This paper contains 12 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: By connecting a driving video generation model with an end-to-end (E2E) planner, we can (1) Evaluate Synthetic Data Quality via Planner by controlling for the same traffic layout and scene conditions as the real videos to assess planner response discrepancies, (2) Assess End-to-end Planner Domain Gap via controlled experiments on operational conditions, and (3) Improve E2E Planner Performance on out-of-distribution domains via synthetic data from the video model. Planner Predictions ($\bm{\rightarrow}$) overlaid. Generated data in italics.
  • Figure 2: Generated videos conditioned on various conditions. (1) The top row displays the input conditions, including road maps and bounding boxes, projected to the camera. (2) The second row shows the corresponding real-world video. The subsequent rows demonstrate the model's ability to generate videos under different conditions: (3) identical conditions to the original video, (4) changing the weather from no-rain to rain, (5) changing the time of day to 00:00 (at midnight), (6) with both rain and nighttime conditions.
  • Figure 3: Model architecture of our video generation model. We enable control of scene and traffic layout (bounding boxes, road map, and ego car pose) and operational conditions (time-of-day, weather), extending the latent video diffusion model W.A.L.T gupta2023photorealisticvideogenerationdiffusion. The conditions are encoded and interact with intermediate features in the diffusion transformer via a combination of AdaLN and cross attention mechanisms. The model is fine-tuned on a large corpus of driving videos.
  • Figure 4: Evaluation of controllable video generation with FVD, ADE@5s, and BPT on 5000 random samples. FVD doesn't fully capture visual quality -- FVD for Rain/Night (relatively rare in our dataset) are much higher (because of distribution shifts) though the photo-realism of videos are visually similar. FVD cannot measure controllability -- removing the conditioning on bounding boxes greatly changes the car locations but has little effect on FVD. ADE and BPT don't suffer from such data distribution shifts, and can capture model controllability -- both metrics are notably worse when bounding boxes are removed.
  • Figure 5: Behavioral Permutation Test (BPT) visualizations. BPT performs a set-to-set comparison of predicted trajectories from real and generated videos. In the top row, when the two sets of trajectories are similar, the distance between the two sets (red dash line) falls well within permuted distributions, resulting in a failure to reject the null hypothesis. The bottom shows a rejection of the null hypothesis, where the two sets of trajectories are significantly different from each other.
  • ...and 2 more figures