TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

Linqing Zhong; Chen Gao; Zihan Ding; Yue Liao; Huimin Ma; Shifeng Zhang; Xu Zhou; Si Liu

TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

Linqing Zhong, Chen Gao, Zihan Ding, Yue Liao, Huimin Ma, Shifeng Zhang, Xu Zhou, Si Liu

TL;DR

This paper addresses Zero-Shot Object Navigation (ZSON) by preserving spatial information through direct top-view map reasoning with Multimodal LLMs (MLLMs), instead of translating visual input into natural language. It introduces three core components: Adaptive Visual Prompt Generation (AVPG) to create a semantically rich top-view map, Dynamic Map Scaling (DMS) to enable zoomed local reasoning, and Potential Target Driven (PTD) to predict target locations and guide exploration via a Gaussian fusion framework. A local policy converts the high-level guidance into low-level actions, and extensive experiments on MP3D and HM3D show state-of-the-art performance with significant gains in SR and SPL compared to prior methods, including training-free baselines. The approach demonstrates the value of leveraging top-view spatial information for robust, human-like exploration and object discovery in unseen environments.

Abstract

The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment. However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information. In this paper, we introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information. To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to conduct thorough reasoning. Besides, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales, enhancing local fine-grained reasoning. Additionally, we devise a Potential Target Driven (PTD) mechanism to predict and to utilize target locations, facilitating global and human-like exploration. Experiments on MP3D and HM3D datasets demonstrate the superiority of our TopV-Nav.

TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

TL;DR

Abstract

TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)