Table of Contents
Fetching ...

Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving

Qiqi Liu, Huan Xu, Jingyu Li, Bin Sun, Zhihui Hao, Dangen She, Xiatian Zhu, Li Zhang

Abstract

Autonomous driving requires reasoning about how the environment evolves and planning actions accordingly. Existing world-model-based approaches typically predict future scenes first and plan afterwards, resulting in open-loop imagination that may drift from the actual decision process. In this paper, we present Uni-World VLA, a unified vision-language-action (VLA) model that tightly interleaves future frame prediction and trajectory planning. Instead of generating a full world rollout before planning, our model alternates between predicting future frames and ego actions step by step, allowing planning decisions to be continuously conditioned on the imagined future observations. This interleaved generation forms a closed-loop interaction between world modeling and control, enabling more adaptive decision-making in dynamic traffic scenarios. In addition, we incorporate monocular depth information into frames to provide stronger geometric cues for world modeling, improving long-horizon scene prediction. Experiments on the NAVSIM benchmark show that our approach achieves competitive closed-loop planning performance while producing high-fidelity future frame predictions. These results demonstrate that tightly coupling world prediction and planning is a promising direction for scalable VLA driving systems.

Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving

Abstract

Autonomous driving requires reasoning about how the environment evolves and planning actions accordingly. Existing world-model-based approaches typically predict future scenes first and plan afterwards, resulting in open-loop imagination that may drift from the actual decision process. In this paper, we present Uni-World VLA, a unified vision-language-action (VLA) model that tightly interleaves future frame prediction and trajectory planning. Instead of generating a full world rollout before planning, our model alternates between predicting future frames and ego actions step by step, allowing planning decisions to be continuously conditioned on the imagined future observations. This interleaved generation forms a closed-loop interaction between world modeling and control, enabling more adaptive decision-making in dynamic traffic scenarios. In addition, we incorporate monocular depth information into frames to provide stronger geometric cues for world modeling, improving long-horizon scene prediction. Experiments on the NAVSIM benchmark show that our approach achieves competitive closed-loop planning performance while producing high-fidelity future frame predictions. These results demonstrate that tightly coupling world prediction and planning is a promising direction for scalable VLA driving systems.

Paper Structure

This paper contains 10 sections, 10 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Different generative paradigms of unified world models for autonomous driving. (a) Unified world models perform video generation and planning as separate tasks; (b) World-conditioned trajectory prediction, where future trajectories are predicted conditioned on the generated world states; (c) Interleaved world modeling and planning (ours). Visual tokens and action queries are generated alternately, forming a closed-loop interaction that respects the temporal causality of driving.
  • Figure 2: Overview of the paradigm of alternative generation in Uni-World VLA. (a) The construction of multi-model historical information; (b) The interleaved frame-action generative paradigm.
  • Figure 3: Schematic illustration of training and inference process. (a) Interleaved sequence for joint video generation and trajectory supervision. (b) Causal attention mask. (c) Autoregressive interleaved inference with KV-cache reuse.
  • Figure 4: Visualization of predicted frames and BEV trajectories
  • Figure 5: Comparison of predicted future frames with and without depth fusion
  • ...and 3 more figures