Table of Contents
Fetching ...

DriveVA: Video Action Models are Zero-Shot Drivers

Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie, Guang Chen, Hangjun Ye, Michael Ying Yang, Francesco Nex, Hao Cheng

Abstract

Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive closed-loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.

DriveVA: Video Action Models are Zero-Shot Drivers

Abstract

Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive closed-loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.

Paper Structure

This paper contains 29 sections, 11 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: DriveVA: unified video--trajectory rollout for planning. Given history frames, DriveVA rolls out a future video clip (top). The ego trajectory is generated together with the video rollout and remains aligned with the visual scene evolution (middle). Bottom: zero-shot comparisons trained on NAVSIM and evaluated on nuScenes (cross dataset) and CARLA (cross domain from real to simulation), showing large relative improvements over PWM zhao2025forecasting in displacement error and collision rate.
  • Figure 2: Overall pipeline of DriveVA. Given history observations, the ego state (current velocity vx, vy), and language instructions, the model first encodes conditional signals into latent tokens through a text encoder and a video VAE wan2025. A unified diffusion transformer (DiT) peebles2023DIT then jointly predicts future video latents and future action tokens in a shared generative process, ensuring strong video–trajectory consistency. To maintain long-horizon temporal coherence, a progressive video continuation strategy recursively rolls out future video clips while updating predicted trajectories.
  • Figure 3: Video--trajectory consistency comparison in zero-shot unseen nuScenes scenarios. In this left-turn scenario, our method produces trajectories that follow the scene evolution in the generated future video. In contrast, PWM zhao2025forecasting predicts a straight trajectory while the generated video indicates a left-turn maneuver, revealing a clear video–trajectory mismatch. Here, green denotes the GT trajectory and red the predicted trajectory.
  • Figure 4: DPVO-based qualitative analysis of video--trajectory consistency on nuScenes. We visualize zero-shot nuScenes scenarios with temporal frames, including lane-change, right-turn, and straight-driving cases. GT Future and Pred Future denote the ground-truth and predicted trajectories, while DPVO(gt img) and DPVO(pred img) denote DPVO reconstructions from the ground-truth and predicted future videos. The close alignment among these curves provides qualitative evidence of strong video--trajectory consistency in DriveVA.
  • Figure 5: Visualization of predicted video imaginations and corresponding trajectories. The predicted trajectories follow the scene evolution in the generated future video frames, demonstrating strong video–trajectory consistency enabled by our unified generation framework.
  • ...and 7 more figures