Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

Xingtai Gui; Meijie Zhang; Tianyi Yan; Wencheng Han; Jiahao Gong; Feiyang Tan; Cheng-zhong Xu; Jianbing Shen

Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

Xingtai Gui, Meijie Zhang, Tianyi Yan, Wencheng Han, Jiahao Gong, Feiyang Tan, Cheng-zhong Xu, Jianbing Shen

Abstract

End-to-end autonomous driving aims to generate safe and plausible planning policies from raw sensor input. Driving world models have shown great potential in learning rich representations by predicting the future evolution of a driving scene. However, existing driving world models primarily focus on visual scene representation, and motion representation is not explicitly designed to be planner-shared and inheritable, leaving a schism between the optimization of visual scene generation and the requirements of precise motion planning. We present WorldDrive, a holistic framework that couples scene generation and real-time planning via unifying vision and motion representation. We first introduce a Trajectory-aware Driving World Model, which conditions on a trajectory vocabulary to enforce consistency between visual dynamics and motion intentions, enabling the generation of diverse and plausible future scenes conditioned on a specific trajectory. We transfer the vision and motion encoders to a downstream Multi-modal Planner, ensuring the driving policy operates on mature representations pre-optimized by scene generation. A simple interaction between motion representation, visual representation, and ego status can generate high-quality, multi-modal trajectories. Furthermore, to exploit the world model's foresight, we propose a Future-aware Rewarder, which distills future latent representation from the frozen world model to evaluate and select optimal trajectories in real-time. Extensive experiments on the NAVSIM, NAVSIM-v2, and nuScenes benchmarks demonstrate that WorldDrive achieves leading planning performance among vision-only methods while maintaining high-fidelity action-controlled video generation capabilities, providing strong evidence for the effectiveness of unifying vision and motion representation for robust autonomous driving.

Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

Abstract

Paper Structure (25 sections, 9 equations, 9 figures, 9 tables)

This paper contains 25 sections, 9 equations, 9 figures, 9 tables.

Introduction
Related Works
End-to-end Autonomous Driving
Driving World Models
World Model for Planning
Method
Trajectory-aware Driving World Model
Multi-modal Trajectory Planner
Driving with Future-aware Rewarder
Training Loss
Experiments
Evaluation on Trajectory Planning
Evaluation on Scene Generation
Implementation Details
WorldDrive on Trajectory Planning
...and 10 more sections

Figures (9)

Figure 1: World models for end-to-end autonomous driving. (a) Planning with future scenes generated by a driving world model. (b) Planning with semantic representation extracted from a latent world model. (c) WorldDrive bridges planning and driving world model via unifying vision and motion representation.
Figure 2: Overall architecture of WorldDrive. WorldDrive is a holistic framework unifying vision and motion representation to bridge scene generation and planning. The training process includes Phase 1: WorldDrive for scene generation and Phase 2: WorldDrive for motion planning. The vision and motion representations are optimized through the scene generation task. In the planning stage, the planner utilizes the frozen vision and trajectory encoders and outputs top-$K$ multi-modal trajectories. A future-aware rewarder is further designed to select the optimal trajectory from the candidates.
Figure 3: Detailed illustration of Future-aware Rewarder. During training, the frozen world model generates future latents. A distillation mechanism aligns the future scene queries with the generated future latents. During the inference phase, the distilled scene features are directly queried by the motion representation to compute future-aware rewards.
Figure 4: Quantitative Analysis of Motion Sensitivity. The similarity between scene representations is inversely correlated with the geometric distance. The sensitivity to both large (a) and small (b) deviations is amplified with further training.
Figure 5: Qualitative planning result of WorldDrive on NAVSIM navtest split. (a) Planning result and the corresponding generated future scene with different trajectories. (b) Top-10 Multi-modal planning trajectories.
...and 4 more figures

Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

Abstract

Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

Authors

Abstract

Table of Contents

Figures (9)