Table of Contents
Fetching ...

DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving

Feiyang jia, Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye, Xiaoshuai Hao, Long Chen

TL;DR

DriveWorld-VLA introduces a tightly coupled latent-space framework that unifies Vision-Language-Action with World Models for autonomous driving. It employs feature-level sharing via a Vision-Language Model, action-conditioned what-if reasoning through a Diffusion Transformer, and a three-stage progressive training scheme to stabilize joint optimization. The approach yields state-of-the-art results on NAVSIMv1 (PDMS 91.3) and NAVSIMv2 (EPDMS 86.8) as well as a low collision rate on nuScenes (0.16%), outperforming contemporary E2E and world-model baselines. By enabling controllable, imagination-guided planning directly in latent space, DriveWorld-VLA offers robust, forward-looking decision-making with potential for safer, more proactive autonomous driving.

Abstract

End-to-end (E2E) autonomous driving has recently attracted increasing interest in unifying Vision-Language-Action (VLA) with World Models to enhance decision-making and forward-looking imagination. However, existing methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions. To address this limitation, we propose DriveWorld-VLA, a novel framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at the representation level, which enables the VLA planner to benefit directly from holistic scene-evolution modeling and reducing reliance on dense annotated supervision. Additionally, DriveWorld-VLA incorporates the latent states of the world model as core decision-making states for the VLA planner, facilitating the planner to assess how candidate actions impact future scene evolution. By conducting world modeling entirely in the latent space, DriveWorld-VLA supports controllable, action-conditioned imagination at the feature level, avoiding expensive pixel-level rollouts. Extensive open-loop and closed-loop evaluations demonstrate the effectiveness of DriveWorld-VLA, which achieves state-of-the-art performance with 91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 3-second average collision rate on nuScenes. Code and models will be released in https://github.com/liulin815/DriveWorld-VLA.git.

DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving

TL;DR

DriveWorld-VLA introduces a tightly coupled latent-space framework that unifies Vision-Language-Action with World Models for autonomous driving. It employs feature-level sharing via a Vision-Language Model, action-conditioned what-if reasoning through a Diffusion Transformer, and a three-stage progressive training scheme to stabilize joint optimization. The approach yields state-of-the-art results on NAVSIMv1 (PDMS 91.3) and NAVSIMv2 (EPDMS 86.8) as well as a low collision rate on nuScenes (0.16%), outperforming contemporary E2E and world-model baselines. By enabling controllable, imagination-guided planning directly in latent space, DriveWorld-VLA offers robust, forward-looking decision-making with potential for safer, more proactive autonomous driving.

Abstract

End-to-end (E2E) autonomous driving has recently attracted increasing interest in unifying Vision-Language-Action (VLA) with World Models to enhance decision-making and forward-looking imagination. However, existing methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions. To address this limitation, we propose DriveWorld-VLA, a novel framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at the representation level, which enables the VLA planner to benefit directly from holistic scene-evolution modeling and reducing reliance on dense annotated supervision. Additionally, DriveWorld-VLA incorporates the latent states of the world model as core decision-making states for the VLA planner, facilitating the planner to assess how candidate actions impact future scene evolution. By conducting world modeling entirely in the latent space, DriveWorld-VLA supports controllable, action-conditioned imagination at the feature level, avoiding expensive pixel-level rollouts. Extensive open-loop and closed-loop evaluations demonstrate the effectiveness of DriveWorld-VLA, which achieves state-of-the-art performance with 91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 3-second average collision rate on nuScenes. Code and models will be released in https://github.com/liulin815/DriveWorld-VLA.git.
Paper Structure (17 sections, 12 equations, 12 figures, 7 tables)

This paper contains 17 sections, 12 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Comparison of VLA & World Model Coupling Strategies.(a) Disentangled Interaction: The world model acts as an external simulator, but its structural isolation from the VLA prevents effective latent knowledge transfer. (b) Feature Sharing: Despite sharing representations, these models lack action-conditioned causal reasoning, which limits their counterfactual imagination and long-horizon planning. (c) Our DriveWorld-VLA: By optimizing world model latents as decision variables, we enable unified causal "what-if" reasoning through controllable imagination in a shared latent space. (d) Performance: DriveWorld-VLA achieves SOTA results—91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 CR on nuScenes—significantly outperforming specialized baselines like LAW li2024enhancing_law, Epona zhang2025epona, and HERMES-p zhou2025_hermes.
  • Figure 2: DriveWorld-VLA pipeline. DriveWorld-VLA unifies action and prospective imagination through a progressive training scheme. Stage 1 jointly learns future BEV imagination and action prediction from a shared latent representation. Stage 2 conditions the generative branch on future actions, enabling controllable imagination that maps a given action sequence to its corresponding future. Stage 3 closes the loop: first predicts actions, then imagines the resulting future, and finally uses reward feedback to refine action prediction.
  • Figure 3: The structure of Action-conditioned Flow-matching Denoiser. The Denoiser conditions the flow-matching process on the BEV state and GT future actions. The BEV state is processed through LayerNorm, followed by scaling and Embedding. These features are then passed through DiT blocks to perform denoising and generate the future BEV states.
  • Figure 4: The 4s trajectory planning visualization examples from NAVSIM dauner2024navsim for DriveWorld-VLA. Stage 2 generates predictions similar to the GTs, but with a higher collision risk. In contrast, Stage 3 introduces future imagination, resulting in more robust predictions and significantly reducing the collision risk. The changes observed across the training stages confirm the accuracy of world modeling and its understanding of physical dynamics. Left label: sample tokens. Top label: source of trajectory.
  • Figure S1: InternVL system prompts. {history_str} denotes the ground-truth historical trajectory sequence, where each frame in the sequence is represented by a 2D coordinate. {command_str} denotes navigation commands in text form, e.g., 'turn left', 'go straight' or 'turn right'. The full prompt fed into the model is composed of each benchmark's specific prompt and the common prompt.
  • ...and 7 more figures