Table of Contents
Fetching ...

DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning

Yang Zhou, Xiaofeng Wang, Hao Shao, Letian Wang, Guosheng Zhao, Jiangnan Shao, Jiagang Zhu, Tingdong Yu, Zheng Zhu, Guan Huang, Steven L. Waslander

Abstract

Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.

DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning

Abstract

Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.

Paper Structure

This paper contains 16 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison of our DriveDreamer-Policy with existing models. Items with dashed lines are optional. Vision-based and VLA planners directly map observations (and optional inputs) to actions without explicitly predicting the future world. World models generate future observations but often rely on external action signals. Recent world–action models unify future world generation and planning, but typically operate on image/video representations. DriveDreamer-Policy extends this line by explicitly generating depth alongside video and actions, enabling geometry-grounded imagination and planning within a unified model.
  • Figure 2: Overview of our DriveDreamer-Policy pipeline. The large language model takes the language instruction, multi-view images and current action, along with a set of learnable queries as inputs to reason and generate world and action embeddings. The generated embeddings are then passed into our three generative expert models as cross-attention conditions to generate depth, future images, and future action.
  • Figure 3: Visualization Results of our method. We show the generated depth, video, and actions, respectively. Depth is truncated to below 80 meters for better visualization. Our generation results remain spatially stable, and the planning performs well compared with human trajectories (e.g., aligns with human trajectories (top) and slows down more effectively than human trajectories (bottom)).
  • Figure 4: Visualization of world learning for planning. Columns compare Action-Only, Depth-Action, Video-Action, and Depth-Video-Action variants. Green denotes the human (expert) trajectory and red denotes the predicted trajectory. The three rows correspond to (top) avoiding potential collision by a slower trajectory, (middle) correcting an initially wrong maneuver, and (bottom) aligning more closely with the human trajectory. Depth and video provide complementary world cues that improve safety margins and trajectory consistency.