Table of Contents
Fetching ...

VectorWorld: Efficient Streaming World Model via Diffusion Flow on Vector Graphs

Chaokang Jiang, Desen Zhou, Jiuming Liu, Kevin Li Sun

Abstract

Closed-loop evaluation of autonomous-driving policies requires interactive simulation beyond log replay. However, existing generative world models often degrade in closed loop due to (i) history-free initialization that mismatches policy inputs, (ii) multi-step sampling latency that violates real-time budgets, and (iii) compounding kinematic infeasibility over long horizons. We propose VectorWorld, a streaming world model that incrementally generates ego-centric $64 \mathrm{m}\times 64\mathrm{m}$ lane--agent vector-graph tiles during rollout. VectorWorld aligns initialization with history-conditioned policies by producing a policy-compatible interaction state via a motion-aware gated VAE. It enables real-time outpainting via solver-free one-step masked completion with an edge-gated relational DiT trained with interval-conditioned MeanFlow and JVP-based large-step supervision. To stabilize long-horizon rollouts, we introduce $Δ$Sim, a physics-aligned non-ego (NPC) policy with hybrid discrete--continuous actions and differentiable kinematic logit shaping. On Waymo open motion and nuPlan, VectorWorld improves map-structure fidelity and initialization validity, and supports stable, real-time $1\mathrm{km}+$ closed-loop rollouts (\href{https://github.com/jiangchaokang/VectorWorld}{code}).

VectorWorld: Efficient Streaming World Model via Diffusion Flow on Vector Graphs

Abstract

Closed-loop evaluation of autonomous-driving policies requires interactive simulation beyond log replay. However, existing generative world models often degrade in closed loop due to (i) history-free initialization that mismatches policy inputs, (ii) multi-step sampling latency that violates real-time budgets, and (iii) compounding kinematic infeasibility over long horizons. We propose VectorWorld, a streaming world model that incrementally generates ego-centric lane--agent vector-graph tiles during rollout. VectorWorld aligns initialization with history-conditioned policies by producing a policy-compatible interaction state via a motion-aware gated VAE. It enables real-time outpainting via solver-free one-step masked completion with an edge-gated relational DiT trained with interval-conditioned MeanFlow and JVP-based large-step supervision. To stabilize long-horizon rollouts, we introduce Sim, a physics-aligned non-ego (NPC) policy with hybrid discrete--continuous actions and differentiable kinematic logit shaping. On Waymo open motion and nuPlan, VectorWorld improves map-structure fidelity and initialization validity, and supports stable, real-time closed-loop rollouts (\href{https://github.com/jiangchaokang/VectorWorld}{code}).
Paper Structure (174 sections, 24 equations, 15 figures, 15 tables, 2 algorithms)

This paper contains 174 sections, 24 equations, 15 figures, 15 tables, 2 algorithms.

Figures (15)

  • Figure 1: Comparison with Prior Works. Unlike prior pipelines that degrade in closed loop due to accumulated drift (autoregressive rollout) or history-free cold-start initialization, VectorWorld supports kilometer-scale closed-loop simulation via streaming outpainting in our experiments. The engine operates in a closed loop: 0) An Ego Planner drives within the scene; 1) Initial Scene Generation creates the base tile; 2) Scene Outpainting dynamically extends the map frontier using edge-gated relational DiT; and 3) NPC Behavior ($\Delta$Sim) ensures reactive interactions. This mechanism enables consistent, kilometer-scale driving simulation by progressively outpainting the world ahead.
  • Figure 2: Interaction-state interface with motion-aware gated VAE. A motion-aware gate $g_i$ fuses static state $s_i$ and motion code $m_i$, suppressing history noise for stationary agents while preserving dynamics for moving ones. A factorized scene transformer aggregates typed relations (L2L/L2A/A2A) and outputs Gaussian latents $z=(z_{\mathrm{lane}},z_{\mathrm{agent}})$ for downstream generation. Warm-started interaction states align initialization with history-conditioned policies.
  • Figure 3: Edge-gated relational DiT for vector-graph latent generation. The framework generates vector-graph latents via a factorized transformer backbone. To capture complex graph dependencies, we introduce a Relational Attention Module (Right). Unlike standard attention, this module conditions message passing on edge features $\mathbf{e}_{ij}$ via: 1) An additive bias $B_{ij}$ applied to attention scores to regulate connectivity; 2) A multiplicative gate $G_{ij}$ applied to value vectors to modulate feature aggregation. This design enables precise structural control across heterogeneous agents and map elements, compatible with various generative dynamics (DDPM, Flow, or MeanFlow).
  • Figure 4: $\Delta$Sim: physics-aligned NPC policy. A single-pass return-to-go (RTG) control embedding modulates the decoder via FiLM $(\gamma,\beta)$. Actions use a hybrid head (discrete k-disks token plus continuous residual) with differentiable kinematic logit shaping and DKAL regularization. This design reduces feasibility violations that compound under kilometer-scale closed-loop simulation.
  • Figure 5: Qualitative comparison on Waymo and nuPlan. Visualizing decoded elements in $64\,\mathrm{m}\times 64\,\mathrm{m}$ tiles. VectorWorld demonstrates superior lane connectivity and fewer agent overlaps at initialization compared to prior vectorized generators.
  • ...and 10 more figures