Table of Contents
Fetching ...

ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation

Yuzhuo Ao, Anbang Wang, Yu-Wing Tai, Chi-Keung Tang

TL;DR

ReasonNavi is introduced, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners, yielding a unified zero-shot navigation framework that requires no MLLM fine-tuning, circumvents the brittleness of RL-based policies and scales naturally with foundation model improvements.

Abstract

Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations, which restrict global foresight and lead to inefficient exploration. In contrast, humans plan using maps: we reason globally first, then act locally. We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners. ReasonNavi converts a top-down map into a discrete reasoning space by room segmentation and candidate target nodes sampling. An MLLM is then queried in a multi-stage process to identify the candidate most consistent with the instruction (object, image, or text goal), effectively leveraging the model's semantic reasoning ability while sidestepping its weakness in continuous coordinate prediction. The selected waypoint is grounded into executable trajectories using a deterministic action planner over an online-built occupancy map, while pretrained object detectors and segmenters ensure robust recognition at the goal. This yields a unified zero-shot navigation framework that requires no MLLM fine-tuning, circumvents the brittleness of RL-based policies and scales naturally with foundation model improvements. Across three navigation tasks, ReasonNavi consistently outperforms prior methods that demand extensive training or heavy scene modeling, offering a scalable, interpretable, and globally grounded solution to embodied navigation. Project page: https://reasonnavi.github.io/

ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation

TL;DR

ReasonNavi is introduced, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners, yielding a unified zero-shot navigation framework that requires no MLLM fine-tuning, circumvents the brittleness of RL-based policies and scales naturally with foundation model improvements.

Abstract

Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations, which restrict global foresight and lead to inefficient exploration. In contrast, humans plan using maps: we reason globally first, then act locally. We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners. ReasonNavi converts a top-down map into a discrete reasoning space by room segmentation and candidate target nodes sampling. An MLLM is then queried in a multi-stage process to identify the candidate most consistent with the instruction (object, image, or text goal), effectively leveraging the model's semantic reasoning ability while sidestepping its weakness in continuous coordinate prediction. The selected waypoint is grounded into executable trajectories using a deterministic action planner over an online-built occupancy map, while pretrained object detectors and segmenters ensure robust recognition at the goal. This yields a unified zero-shot navigation framework that requires no MLLM fine-tuning, circumvents the brittleness of RL-based policies and scales naturally with foundation model improvements. Across three navigation tasks, ReasonNavi consistently outperforms prior methods that demand extensive training or heavy scene modeling, offering a scalable, interpretable, and globally grounded solution to embodied navigation. Project page: https://reasonnavi.github.io/
Paper Structure (26 sections, 6 equations, 6 figures, 3 tables, 3 algorithms)

This paper contains 26 sections, 6 equations, 6 figures, 3 tables, 3 algorithms.

Figures (6)

  • Figure 1: Main difference between our ReasonNavi and previous exploration-based methods: in ReasonNavi, after reasoning and obtaining the location of the desired target, the controlled agent will directly walk towards the object, whereas exploration-based methods heavily rely on extensive local semantic recognition or matching.
  • Figure 2: The ReasonNavi Framework in two stages: 1) Global Reasoning, where a Multimodal Large Language Model (MLLM) reasons about a top-down map and the goal instruction through a multi-stage discrete selection process to determine a precise global target waypoint ($p_{\text{global}}$). This MLLM reasoning stage can be further enhanced by a model ensemble for increased robustness; 2) Local Navigation, where a deterministic planner safely guides the agent to the selected global waypoint using an online occupancy map.
  • Figure 3: For global reasoning, instead of querying the MLLM for a direct coordinate, we devise a hierarchical, two-stage framework, which effectively leverages MLLM's vision priors.
  • Figure 4: Demonstration of ReasonNavi's generalization to diverse map modalities. The agent successfully plans a path from the start point (green) to the goal (blue) on both a clean CAD drawing and a reconstructed map.
  • Figure 5: The ambiguity of the global map may influence the outcome. For example, MLLM sometimes struggles to pin-point the clothes (at the right-bottom corner) on the left map or the tv monitor (at the left-bottom corner) on the right map. (denoted with blue bounding box)
  • ...and 1 more figures