Table of Contents
Fetching ...

GenAD: Generalized Predictive Model for Autonomous Driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, Hongyang Li

TL;DR

GenAD proposes a generalized video-prediction framework for autonomous driving by leveraging the largest public driving-video dataset OpenDV-2K and a two-stage training paradigm that first adapts a latent diffusion image model to driving and then learns temporal dynamics via causal temporal and decoupled spatial attention blocks. The model demonstrates strong zero-shot generalization across diverse unseen datasets and supports extensions to action-conditioned prediction and planning, achieving efficient downstream adaptation. Key contributions include the OpenDV-2K dataset, the two-stage GenAD architecture with temporal reasoning blocks, and demonstrated improvements over state-of-the-art baselines in both fidelity and temporal coherence. The work offers a scalable, open-world foundation for learning world models and planning capabilities, with practical impact for simulation and real-world driving systems.

Abstract

In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models, our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner, surpassing general or driving-specific video prediction counterparts. Furthermore, GenAD can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.

GenAD: Generalized Predictive Model for Autonomous Driving

TL;DR

GenAD proposes a generalized video-prediction framework for autonomous driving by leveraging the largest public driving-video dataset OpenDV-2K and a two-stage training paradigm that first adapts a latent diffusion image model to driving and then learns temporal dynamics via causal temporal and decoupled spatial attention blocks. The model demonstrates strong zero-shot generalization across diverse unseen datasets and supports extensions to action-conditioned prediction and planning, achieving efficient downstream adaptation. Key contributions include the OpenDV-2K dataset, the two-stage GenAD architecture with temporal reasoning blocks, and demonstrated improvements over state-of-the-art baselines in both fidelity and temporal coherence. The work offers a scalable, open-world foundation for learning world models and planning capabilities, with practical impact for simulation and real-world driving systems.

Abstract

In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models, our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner, surpassing general or driving-specific video prediction counterparts. Furthermore, GenAD can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.
Paper Structure (54 sections, 2 equations, 16 figures, 17 tables)

This paper contains 54 sections, 2 equations, 16 figures, 17 tables.

Figures (16)

  • Figure 1: OpenDV-2K comparison at a glance to existing counterparts in terms of scale and diversity. Note that datasets with ✓ are included in OpenDV-2K (last row). $^\star$Perception subset in Waymo Open, Argoverse 2, and nuPlan. $^\dagger$Estimated by GPT Ouyang2022InstructGPT from video titles.
  • Figure 2: Geographic distribution of OpenDV-2K. Our dataset covers ample driving scenarios around the world.
  • Figure 3: Dataset construction of OpenDV-YouTube with quality check in the loop. We collect videos from YouTubers with qualified driving videos, and dispose of those with inappropriate viewpoints or involving scene transitions. Then each frame is described with language contexts using VLM followed by keyword checks on texts, such as "words", "watermark", "dark", "blurry", etc. Through this process, distorted or entirely black images are wiped out. A classifier tags videos with high-level intentions as commands, incubating the final data corpus of high-quality video-text pairs being 1747 hours long.
  • Figure 4: Framework of GenAD. (a) The two-stage learning for GenAD is composed of transferring the image domain of an image diffusion model to the driving field (a.1 Stage one), and video prediction pre-training for modeling the temporal dependency of videos (a.2 Stage two). (b) One transformer block in GenAD for the second stage training has interleaved temporal reasoning blocks before each frozen layer to align spatiotemporal features. (c) The proposed Temporal Reasoning Block includes one causal temporal attention (TA) and two decoupled spatial attention (SA) layers to extract features in different axes. A query grid attends to itself as well as blue grids while the dark gray grid is masked out in causal attention. ' Zero init' is appended at the end of each attention block to stabilize training.
  • Figure 5: Task on zero-shot video prediction for unseen scenarios. We show the generation results (in blue boxes) of different models given the same starting frames. GenAD makes more robust, realistic, and reasonable future predictions on unseen datasets (scenarios). More comparisons (\ref{['fig:zero-shot-public']}) and visualizations (\ref{['fig:zero-shot-youtube']}) are shown in Appendix.
  • ...and 11 more figures