Table of Contents
Fetching ...

X-MOBILITY: End-To-End Generalizable Navigation via World Modeling

Wei Liu, Huihua Zhao, Chenran Li, Joydeep Biswas, Billy Okal, Pulkit Goyal, Yan Chang, Soha Pouya

TL;DR

X-Mobility presents an end-to-end world-model-based navigation framework that learns a probabilistic latent state via an auto-regressive architecture and uses decoupled training to separate world dynamics from policy learning. The model incorporates a history-aware state estimator, a state predictor aligned by KL divergence, and multi-task decoders (RGB via latent diffusion and semantic via StyleGAN) to cultivate a semantically rich latent space for navigation decisions. Training is conducted in two stages on photorealistic synthetic data: off-policy world-model learning followed by on-policy policy learning, enabling strong generalization and zero-shot Sim2Real transfer, as well as cross-embodiment capabilities across multiple robot platforms. The results show superior closed-loop navigation performance compared to baselines, with ablations confirming the value of world-model pretraining, semantic decoding, and history integration, while highlighting nuanced effects of diffusion-based policy decoding on stability and path prediction. The work demonstrates practical implications for scalable, data-efficient navigation in diverse environments and under real-world constraints.

Abstract

General-purpose navigation in challenging environments remains a significant problem in robotics, with current state-of-the-art approaches facing myriad limitations. Classical approaches struggle with cluttered settings and require extensive tuning, while learning-based methods face difficulties generalizing to out-of-distribution environments. This paper introduces X-Mobility, an end-to-end generalizable navigation model that overcomes existing challenges by leveraging three key ideas. First, X-Mobility employs an auto-regressive world modeling architecture with a latent state space to capture world dynamics. Second, a diverse set of multi-head decoders enables the model to learn a rich state representation that correlates strongly with effective navigation skills. Third, by decoupling world modeling from action policy, our architecture can train effectively on a variety of data sources, both with and without expert policies: off-policy data allows the model to learn world dynamics, while on-policy data with supervisory control enables optimal action policy learning. Through extensive experiments, we demonstrate that X-Mobility not only generalizes effectively but also surpasses current state-of-the-art navigation approaches. Additionally, X-Mobility also achieves zero-shot Sim2Real transferability and shows strong potential for cross-embodiment generalization.

X-MOBILITY: End-To-End Generalizable Navigation via World Modeling

TL;DR

X-Mobility presents an end-to-end world-model-based navigation framework that learns a probabilistic latent state via an auto-regressive architecture and uses decoupled training to separate world dynamics from policy learning. The model incorporates a history-aware state estimator, a state predictor aligned by KL divergence, and multi-task decoders (RGB via latent diffusion and semantic via StyleGAN) to cultivate a semantically rich latent space for navigation decisions. Training is conducted in two stages on photorealistic synthetic data: off-policy world-model learning followed by on-policy policy learning, enabling strong generalization and zero-shot Sim2Real transfer, as well as cross-embodiment capabilities across multiple robot platforms. The results show superior closed-loop navigation performance compared to baselines, with ablations confirming the value of world-model pretraining, semantic decoding, and history integration, while highlighting nuanced effects of diffusion-based policy decoding on stability and path prediction. The work demonstrates practical implications for scalable, data-efficient navigation in diverse environments and under real-world constraints.

Abstract

General-purpose navigation in challenging environments remains a significant problem in robotics, with current state-of-the-art approaches facing myriad limitations. Classical approaches struggle with cluttered settings and require extensive tuning, while learning-based methods face difficulties generalizing to out-of-distribution environments. This paper introduces X-Mobility, an end-to-end generalizable navigation model that overcomes existing challenges by leveraging three key ideas. First, X-Mobility employs an auto-regressive world modeling architecture with a latent state space to capture world dynamics. Second, a diverse set of multi-head decoders enables the model to learn a rich state representation that correlates strongly with effective navigation skills. Third, by decoupling world modeling from action policy, our architecture can train effectively on a variety of data sources, both with and without expert policies: off-policy data allows the model to learn world dynamics, while on-policy data with supervisory control enables optimal action policy learning. Through extensive experiments, we demonstrate that X-Mobility not only generalizes effectively but also surpasses current state-of-the-art navigation approaches. Additionally, X-Mobility also achieves zero-shot Sim2Real transferability and shows strong potential for cross-embodiment generalization.

Paper Structure

This paper contains 30 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: X-Mobility: an end-to-end world model based navigation stack featuring a multi-stage training pipeline using photorealistic synthetic datasets, demonstrating generalizability across out-of-distribution scenarios and zero-shot Sim2Real transferability.
  • Figure 2: Model architecture: i) The state estimator in (c) and state predictor in (d) are designed to capture world dynamics; ii) Along with the multi-task decoders in (e), they generate a rich latent space representation for action policy learning, which takes the latent state and route embedding as inputs.
  • Figure 3: Benchmark warehouse scenarios with varying levels of difficulty.
  • Figure 4: Attention analysis: (a) pretrained DINOv2, (b) fine tuned with semantic decoding. Attention is directed toward key semantic objects (e.g., sign, pallet, fence) when semantic decoding is enabled, as opposed to being scattered across the entire image without semantic decoding.
  • Figure 5: Qualitative examples of prediction by decoding semantic segmentation from latent state. Green: Navigable, Red: Fence, Blue: Pallet, Orange: Forklift, Purple: Sign
  • ...and 3 more figures