MWM: Mobile World Models for Action-Conditioned Consistent Prediction

Han Yan; Zishang Xiang; Zeyu Zhang; Hao Tang

MWM: Mobile World Models for Action-Conditioned Consistent Prediction

Han Yan, Zishang Xiang, Zeyu Zhang, Hao Tang

TL;DR

This work proposes MWM, a mobile world model for planning-based image-goal navigation that combines structure pretraining with Action-Conditioned Consistency post-training to improve action-conditioned rollout consistency, and introduces Inference-Consistent State Distillation (ICSD) for few-step diffusion distillation with improved rollout consistency.

Abstract

World models enable planning in imagined future predicted space, offering a promising framework for embodied navigation. However, existing navigation world models often lack action-conditioned consistency, so visually plausible predictions can still drift under multi-step rollout and degrade planning. Moreover, efficient deployment requires few-step diffusion inference, but existing distillation methods do not explicitly preserve rollout consistency, creating a training-inference mismatch. To address these challenges, we propose MWM, a mobile world model for planning-based image-goal navigation. Specifically, we introduce a two-stage training framework that combines structure pretraining with Action-Conditioned Consistency (ACC) post-training to improve action-conditioned rollout consistency. We further introduce Inference-Consistent State Distillation (ICSD) for few-step diffusion distillation with improved rollout consistency. Our experiments on benchmark and real-world tasks demonstrate consistent gains in visual fidelity, trajectory accuracy, planning success, and inference efficiency. Code: https://github.com/AIGeeksGroup/MWM. Website: https://aigeeksgroup.github.io/MWM.

MWM: Mobile World Models for Action-Conditioned Consistent Prediction

TL;DR

Abstract

Paper Structure (19 sections, 8 equations, 5 figures, 8 tables)

This paper contains 19 sections, 8 equations, 5 figures, 8 tables.

INTRODUCTION
Related Work
The Proposed Method
Overview
Two-Stage Training Pipeline For MWM
Stage I: Structure Pretraining
Stage II: Action-Conditioned Consistency (ACC) Post-training
Inference-Consistent State Distillation (ICSD)
Planning with MWM
Experiments
Experimental Settings
Main Results
Ablation Studies
RealWorld Evaluation
Robot Setup
...and 4 more sections

Figures (5)

Figure 1: Real-world demonstration of MWM. Upon receiving current observations, MWM imagines action-conditioned future trajectories. The planning is performed over candidate rollouts to identify the optimal navigation plan, enabling successful image-goal navigation in real-world environments.
Figure 2: Overview of the Two-stage training pipeline for MWM. Our training paradigm first performs structure pretraining to learn fine-grained geometry and illumination-dependent appearance, then applies ACC post-training to mitigate compounding error while freezing the CDiT backbone and updating only AdaLN. Within post-training, we introduce ICSD to enable distillation that preserves the consistency objective, while aligning truncated training-time estimates with the inference-time endpoint.
Figure 3: Qualitative results on SCAND. The predicted frames exhibit action-conditioned consistency (ACC) with the ground-truth frames.
Figure 4: Qualitative real-world evaluation. MWM generates planned rollouts that align better with real observations than NWM, indicating reduced multi-step error accumulation and improved planning quality. Blank frames in the figure indicate cases where the robot was forcibly emergency-stopped due to imminent collision.
Figure 5: Real-world deployment setup on the AIRBOT Mobile Manipulation Kit 2 (MMK2). (a) Hardware platform. (b) Deployment process.

MWM: Mobile World Models for Action-Conditioned Consistent Prediction

TL;DR

Abstract

MWM: Mobile World Models for Action-Conditioned Consistent Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (5)