Table of Contents
Fetching ...

TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

Linqing Zhong, Chen Gao, Zihan Ding, Yue Liao, Huimin Ma, Shifeng Zhang, Xu Zhou, Si Liu

TL;DR

This paper addresses Zero-Shot Object Navigation (ZSON) by preserving spatial information through direct top-view map reasoning with Multimodal LLMs (MLLMs), instead of translating visual input into natural language. It introduces three core components: Adaptive Visual Prompt Generation (AVPG) to create a semantically rich top-view map, Dynamic Map Scaling (DMS) to enable zoomed local reasoning, and Potential Target Driven (PTD) to predict target locations and guide exploration via a Gaussian fusion framework. A local policy converts the high-level guidance into low-level actions, and extensive experiments on MP3D and HM3D show state-of-the-art performance with significant gains in SR and SPL compared to prior methods, including training-free baselines. The approach demonstrates the value of leveraging top-view spatial information for robust, human-like exploration and object discovery in unseen environments.

Abstract

The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment. However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information. In this paper, we introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information. To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to conduct thorough reasoning. Besides, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales, enhancing local fine-grained reasoning. Additionally, we devise a Potential Target Driven (PTD) mechanism to predict and to utilize target locations, facilitating global and human-like exploration. Experiments on MP3D and HM3D datasets demonstrate the superiority of our TopV-Nav.

TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

TL;DR

This paper addresses Zero-Shot Object Navigation (ZSON) by preserving spatial information through direct top-view map reasoning with Multimodal LLMs (MLLMs), instead of translating visual input into natural language. It introduces three core components: Adaptive Visual Prompt Generation (AVPG) to create a semantically rich top-view map, Dynamic Map Scaling (DMS) to enable zoomed local reasoning, and Potential Target Driven (PTD) to predict target locations and guide exploration via a Gaussian fusion framework. A local policy converts the high-level guidance into low-level actions, and extensive experiments on MP3D and HM3D show state-of-the-art performance with significant gains in SR and SPL compared to prior methods, including training-free baselines. The approach demonstrates the value of leveraging top-view spatial information for robust, human-like exploration and object discovery in unseen environments.

Abstract

The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment. However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information. In this paper, we introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information. To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to conduct thorough reasoning. Besides, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales, enhancing local fine-grained reasoning. Additionally, we devise a Potential Target Driven (PTD) mechanism to predict and to utilize target locations, facilitating global and human-like exploration. Experiments on MP3D and HM3D datasets demonstrate the superiority of our TopV-Nav.

Paper Structure

This paper contains 16 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Current LLM-based methods lie in two exploration paradigms, i.e., frontier-based and waypoint-based. They conduct map-to-text conversion for LLM reasoning in linguistic domain, losing the spatial information embedded in the map, e.g., room layout and spatial relation among objects. (b) TopV-Nav takes the top-view map as input and leverages MLLM to directly reason on the map image, fully utilizing the spatial information in the map.
  • Figure 2: Overall framework of TopV-Nav. During navigation, the agent receives egocentric RGB-D images $I_t$ from the environment, and AVPG constructs a corresponding top-view map $M_t$. Note that visual prompts are adaptively drawn onto the map, where various elements are spatially arranged to reflect their spatial relationships. Subsequently, in DMS, we leverage MLLM to interpret $M_t$ and optionally select a region of interest. Then, the map is scaled according to the predicted center coordinates and dynamic scaling factor to reveal more detailed spatial information. Following that, in PTD, MLLM interprets the scaled map $M_{t,sub}$ to estimate the potential location of the target object and assign probability scores to key areas. Then, we adopt a Gaussian-based fusion strategy to obtain a value map, in which the moving location is decided accordingly. Finally, the local policy is leveraged to generate a series of low-level actions towards the moving location.
  • Figure 3: Illustration of the DMS mechanism.
  • Figure 4: Illustration of the PTD mechanism.
  • Figure 5: Qualitative comparisons of navigation decisions between TopV-Nav and LLM-based baseline. Best viewed in color.