Table of Contents
Fetching ...

Map-World: Masked Action planning and Path-Integral World Model for Autonomous Driving

Bin Hu, Zijian Lu, Haicheng Liao, Chengran Yuan, Bin Rao, Yongkang Li, Guofa Li, Zhiyong Cui, Cheng-zhong Xu, Zhenning Li

TL;DR

MAP-World tackles multi-modal autonomous driving planning by marrying Masked Action Planning with a path-integralized World Model, enabling prior-free generation of diverse, history-consistent trajectories and end-to-end training over the full distribution of futures. By treating future ego motion as masked tokens conditioned on a driving-intent scaffold, it produces multiple trajectory hypotheses without anchors, while a lightweight world model evaluates each candidate’s BEV semantics and drives learning through a path-weighted objective. The approach yields state-of-the-art results among world-model-based planners on NAVSIM and strong open-loop performance on nuScenes, with real-time latency suitable for deployment. This framework reduces reliance on handcrafted priors or RL selectors, offering a scalable, differentiable alternative for robust multi-modal planning in complex road scenes.

Abstract

Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich multi-modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving-intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor-based approaches and achieves state-of-the-art performance among world-model-based methods, while avoiding reinforcement learning and maintaining real-time inference latency.

Map-World: Masked Action planning and Path-Integral World Model for Autonomous Driving

TL;DR

MAP-World tackles multi-modal autonomous driving planning by marrying Masked Action Planning with a path-integralized World Model, enabling prior-free generation of diverse, history-consistent trajectories and end-to-end training over the full distribution of futures. By treating future ego motion as masked tokens conditioned on a driving-intent scaffold, it produces multiple trajectory hypotheses without anchors, while a lightweight world model evaluates each candidate’s BEV semantics and drives learning through a path-weighted objective. The approach yields state-of-the-art results among world-model-based planners on NAVSIM and strong open-loop performance on nuScenes, with real-time latency suitable for deployment. This framework reduces reliance on handcrafted priors or RL selectors, offering a scalable, differentiable alternative for robust multi-modal planning in complex road scenes.

Abstract

Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich multi-modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving-intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor-based approaches and achieves state-of-the-art performance among world-model-based methods, while avoiding reinforcement learning and maintaining real-time inference latency.

Paper Structure

This paper contains 19 sections, 15 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Anchor-based selection versus MAP-World. (a) DiffusionDrive generates multi-modal trajectories tied to an anchor set and then selects one as the final plan. (b) MAP-World predicts trajectories directly via masked action planning, without anchors, allowing a broader family of motion modes that better aligns with the ground truth.
  • Figure 2: Overview of MAP-World. (a) Multi-view images and LiDAR are encoded to obtain the current BEV features. The encoded ego state is fused with the BEV features to form the current state representation. (b) Masked Action Planning generates multi-modal trajectories by applying a Transformer decoder to the current state representation. (c) The BEV world model conditions on the multi-modal trajectories and current BEV features to synthesize future BEV features, which are trained via losses against the BEV semantic map and evaluated under a path-integral formulation.
  • Figure 3: Visualization results comparing our method with WoTE. Because our model is not constrained by trajectory anchors and learns from the full set of future features, it trains efficiently and outperforms WoTE across both simple and complex scenes, including challenging edge cases.
  • Figure 4: Visualization of trajectories with noise perturbations of different factors.
  • Figure 5: Qualitative comparison of WoTE and Map-World on turning right scenarios of NAVSIM navtest split.
  • ...and 2 more figures