Table of Contents
Fetching ...

One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation

Zerui Li, Hongpei Zheng, Fangguo Zhao, Aidan Chan, Jian Zhou, Sihao Lin, Shijie Li, Qi Wu

TL;DR

This work tackles Vision-and-Language Navigation in Continuous Environments by introducing GTA, a decoupled framework that separates spatial state estimation from semantic planning. It leverages an Interactive Metric World Representation that fuses a TSDF-based volumetric map with a topological graph, and a Counterfactual Reasoning Brain that enables physically grounded, future-oriented planning via structured prompts. The approach yields state-of-the-art zero-shot performance on R2R-CE and RxR-CE, with strong sim-to-real transfer demonstrated on a TurtleBot 4 and a custom aerial drone. The results suggest that explicit, metric-grounded world models are key to grounding high-level MLLM reasoning into reliable, executable navigation actions in real-world environments.

Abstract

A navigable agent needs to understand both high-level semantic instructions and precise spatial perceptions. Building navigation agents centered on Multimodal Large Language Models (MLLMs) demonstrates a promising solution due to their powerful generalization ability. However, the current tightly coupled design dramatically limits system performance. In this work, we propose a decoupled design that separates low-level spatial state estimation from high-level semantic planning. Unlike previous methods that rely on predefined, oversimplified textual maps, we introduce an interactive metric world representation that maintains rich and consistent information, allowing MLLMs to interact with and reason on it for decision-making. Furthermore, counterfactual reasoning is introduced to further elicit MLLMs' capacity, while the metric world representation ensures the physical validity of the produced actions. We conduct comprehensive experiments in both simulated and real-world environments. Our method establishes a new zero-shot state-of-the-art, achieving 48.8\% Success Rate (SR) in R2R-CE and 42.2\% in RxR-CE benchmarks. Furthermore, to validate the versatility of our metric representation, we demonstrate zero-shot sim-to-real transfer across diverse embodiments, including a wheeled TurtleBot 4 and a custom-built aerial drone. These real-world deployments verify that our decoupled framework serves as a robust, domain-invariant interface for embodied Vision-and-Language navigation.

One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation

TL;DR

This work tackles Vision-and-Language Navigation in Continuous Environments by introducing GTA, a decoupled framework that separates spatial state estimation from semantic planning. It leverages an Interactive Metric World Representation that fuses a TSDF-based volumetric map with a topological graph, and a Counterfactual Reasoning Brain that enables physically grounded, future-oriented planning via structured prompts. The approach yields state-of-the-art zero-shot performance on R2R-CE and RxR-CE, with strong sim-to-real transfer demonstrated on a TurtleBot 4 and a custom aerial drone. The results suggest that explicit, metric-grounded world models are key to grounding high-level MLLM reasoning into reliable, executable navigation actions in real-world environments.

Abstract

A navigable agent needs to understand both high-level semantic instructions and precise spatial perceptions. Building navigation agents centered on Multimodal Large Language Models (MLLMs) demonstrates a promising solution due to their powerful generalization ability. However, the current tightly coupled design dramatically limits system performance. In this work, we propose a decoupled design that separates low-level spatial state estimation from high-level semantic planning. Unlike previous methods that rely on predefined, oversimplified textual maps, we introduce an interactive metric world representation that maintains rich and consistent information, allowing MLLMs to interact with and reason on it for decision-making. Furthermore, counterfactual reasoning is introduced to further elicit MLLMs' capacity, while the metric world representation ensures the physical validity of the produced actions. We conduct comprehensive experiments in both simulated and real-world environments. Our method establishes a new zero-shot state-of-the-art, achieving 48.8\% Success Rate (SR) in R2R-CE and 42.2\% in RxR-CE benchmarks. Furthermore, to validate the versatility of our metric representation, we demonstrate zero-shot sim-to-real transfer across diverse embodiments, including a wheeled TurtleBot 4 and a custom-built aerial drone. These real-world deployments verify that our decoupled framework serves as a robust, domain-invariant interface for embodied Vision-and-Language navigation.
Paper Structure (32 sections, 7 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 32 sections, 7 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Bridging high-level reasoning and embodied execution with GTA.Left: Existing MLLM-based VLN agents reduce the rich 3D environment into an oversimplified linearized text memory according to human knowledge. Right: Our GTA framework decouples spatial modeling from semantic reasoning. Our Interactive Metric World Representation maintains rich spatial and historical information, enabling the MLLM to interact with it accordingly for decision-making. This also enables Counterfactual Reasoning, which further elicits the MLLM's capacity.
  • Figure 2: Qualitative visualization of a navigation episode in R2R-CE. The top row displays the agent's Metric World Representation (top-down metric view), while the bottom row shows the corresponding egocentric panoramic observations at each step. The planned trajectory is marked in yellow, with blue dots indicating waypoints and the red arrow showing the agent's current pose.
  • Figure 3: Overview of the GTA Framework. Our architecture decouples spatial modeling from semantic reasoning. The Metric Mapping Module (left) fuses sparse RGB-D streams via TSDF reconstruction to synthesize a real-time metric map. We construct the Interactive Metric World Representation by unifying this geometric reconstruction with Procedur al Reasoning Blueprints, which comprise the logical "TODO List" and topological history. This composite spatial-logic state is rendered via our Interactive Reasoning Interface into a structured prompt for Counterfactual Reasoning Brain driven by the frozen MLLM (right). Leveraging this unified context, the MLLM directly infers the next metric waypoint $(x, y, z)$
  • Figure 4: Zero-shot Sim-to-Real transfer across diverse robot embodiments. We demonstrate the generalization capability of GTA by deploying it on two distinct physical platforms in unseen real-world environments. Top Rows (Wheeled Robot): A TurtleBot 4 successfully executes instructions requiring obstacle negotiation and semantic grounding of large objects. Bottom Row (Drone): A custom-built aerial vehicle utilizes the same framework to locate a fine-grained target.