X-MOBILITY: End-To-End Generalizable Navigation via World Modeling

Wei Liu; Huihua Zhao; Chenran Li; Joydeep Biswas; Billy Okal; Pulkit Goyal; Yan Chang; Soha Pouya

X-MOBILITY: End-To-End Generalizable Navigation via World Modeling

Wei Liu, Huihua Zhao, Chenran Li, Joydeep Biswas, Billy Okal, Pulkit Goyal, Yan Chang, Soha Pouya

TL;DR

X-Mobility presents an end-to-end world-model-based navigation framework that learns a probabilistic latent state via an auto-regressive architecture and uses decoupled training to separate world dynamics from policy learning. The model incorporates a history-aware state estimator, a state predictor aligned by KL divergence, and multi-task decoders (RGB via latent diffusion and semantic via StyleGAN) to cultivate a semantically rich latent space for navigation decisions. Training is conducted in two stages on photorealistic synthetic data: off-policy world-model learning followed by on-policy policy learning, enabling strong generalization and zero-shot Sim2Real transfer, as well as cross-embodiment capabilities across multiple robot platforms. The results show superior closed-loop navigation performance compared to baselines, with ablations confirming the value of world-model pretraining, semantic decoding, and history integration, while highlighting nuanced effects of diffusion-based policy decoding on stability and path prediction. The work demonstrates practical implications for scalable, data-efficient navigation in diverse environments and under real-world constraints.

Abstract

General-purpose navigation in challenging environments remains a significant problem in robotics, with current state-of-the-art approaches facing myriad limitations. Classical approaches struggle with cluttered settings and require extensive tuning, while learning-based methods face difficulties generalizing to out-of-distribution environments. This paper introduces X-Mobility, an end-to-end generalizable navigation model that overcomes existing challenges by leveraging three key ideas. First, X-Mobility employs an auto-regressive world modeling architecture with a latent state space to capture world dynamics. Second, a diverse set of multi-head decoders enables the model to learn a rich state representation that correlates strongly with effective navigation skills. Third, by decoupling world modeling from action policy, our architecture can train effectively on a variety of data sources, both with and without expert policies: off-policy data allows the model to learn world dynamics, while on-policy data with supervisory control enables optimal action policy learning. Through extensive experiments, we demonstrate that X-Mobility not only generalizes effectively but also surpasses current state-of-the-art navigation approaches. Additionally, X-Mobility also achieves zero-shot Sim2Real transferability and shows strong potential for cross-embodiment generalization.

X-MOBILITY: End-To-End Generalizable Navigation via World Modeling

TL;DR

Abstract

X-MOBILITY: End-To-End Generalizable Navigation via World Modeling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)