Table of Contents
Fetching ...

MWM: Mobile World Models for Action-Conditioned Consistent Prediction

Han Yan, Zishang Xiang, Zeyu Zhang, Hao Tang

TL;DR

This work proposes MWM, a mobile world model for planning-based image-goal navigation that combines structure pretraining with Action-Conditioned Consistency post-training to improve action-conditioned rollout consistency, and introduces Inference-Consistent State Distillation (ICSD) for few-step diffusion distillation with improved rollout consistency.

Abstract

World models enable planning in imagined future predicted space, offering a promising framework for embodied navigation. However, existing navigation world models often lack action-conditioned consistency, so visually plausible predictions can still drift under multi-step rollout and degrade planning. Moreover, efficient deployment requires few-step diffusion inference, but existing distillation methods do not explicitly preserve rollout consistency, creating a training-inference mismatch. To address these challenges, we propose MWM, a mobile world model for planning-based image-goal navigation. Specifically, we introduce a two-stage training framework that combines structure pretraining with Action-Conditioned Consistency (ACC) post-training to improve action-conditioned rollout consistency. We further introduce Inference-Consistent State Distillation (ICSD) for few-step diffusion distillation with improved rollout consistency. Our experiments on benchmark and real-world tasks demonstrate consistent gains in visual fidelity, trajectory accuracy, planning success, and inference efficiency. Code: https://github.com/AIGeeksGroup/MWM. Website: https://aigeeksgroup.github.io/MWM.

MWM: Mobile World Models for Action-Conditioned Consistent Prediction

TL;DR

This work proposes MWM, a mobile world model for planning-based image-goal navigation that combines structure pretraining with Action-Conditioned Consistency post-training to improve action-conditioned rollout consistency, and introduces Inference-Consistent State Distillation (ICSD) for few-step diffusion distillation with improved rollout consistency.

Abstract

World models enable planning in imagined future predicted space, offering a promising framework for embodied navigation. However, existing navigation world models often lack action-conditioned consistency, so visually plausible predictions can still drift under multi-step rollout and degrade planning. Moreover, efficient deployment requires few-step diffusion inference, but existing distillation methods do not explicitly preserve rollout consistency, creating a training-inference mismatch. To address these challenges, we propose MWM, a mobile world model for planning-based image-goal navigation. Specifically, we introduce a two-stage training framework that combines structure pretraining with Action-Conditioned Consistency (ACC) post-training to improve action-conditioned rollout consistency. We further introduce Inference-Consistent State Distillation (ICSD) for few-step diffusion distillation with improved rollout consistency. Our experiments on benchmark and real-world tasks demonstrate consistent gains in visual fidelity, trajectory accuracy, planning success, and inference efficiency. Code: https://github.com/AIGeeksGroup/MWM. Website: https://aigeeksgroup.github.io/MWM.
Paper Structure (19 sections, 8 equations, 5 figures, 8 tables)

This paper contains 19 sections, 8 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Real-world demonstration of MWM. Upon receiving current observations, MWM imagines action-conditioned future trajectories. The planning is performed over candidate rollouts to identify the optimal navigation plan, enabling successful image-goal navigation in real-world environments.
  • Figure 2: Overview of the Two-stage training pipeline for MWM. Our training paradigm first performs structure pretraining to learn fine-grained geometry and illumination-dependent appearance, then applies ACC post-training to mitigate compounding error while freezing the CDiT backbone and updating only AdaLN. Within post-training, we introduce ICSD to enable distillation that preserves the consistency objective, while aligning truncated training-time estimates with the inference-time endpoint.
  • Figure 3: Qualitative results on SCAND. The predicted frames exhibit action-conditioned consistency (ACC) with the ground-truth frames.
  • Figure 4: Qualitative real-world evaluation. MWM generates planned rollouts that align better with real observations than NWM, indicating reduced multi-step error accumulation and improved planning quality. Blank frames in the figure indicate cases where the robot was forcibly emergency-stopped due to imminent collision.
  • Figure 5: Real-world deployment setup on the AIRBOT Mobile Manipulation Kit 2 (MMK2). (a) Hardware platform. (b) Deployment process.