Table of Contents
Fetching ...

PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory

Qunchao Jin, Yilin Wu, Changhao Chen

TL;DR

PanoNav tackles RGB-only mapless zero-shot object navigation by introducing Panoramic Scene Parsing to extract fine-grained local and global spatial cues from six directional RGB views, paired with dot matrix inputs. It further introduces a Dynamic Bounded Memory Queue to incorporate exploration history, guiding the LLM-based decision-maker to avoid local deadlocks and improve exploration efficiency. On the HM3D benchmark, PanoNav surpasses state-of-the-art baselines in SR and SPL under mapless, open-vocabulary settings, validating both the perceptual parsing and memory-guided decision components. The approach demonstrates that rich panoramic parsing combined with historical context enables robust, open-vocabulary navigation using only RGB inputs, with practical implications for robust, hardware-efficient household robots.

Abstract

Zero-shot object navigation (ZSON) in unseen environments remains a challenging problem for household robots, requiring strong perceptual understanding and decision-making capabilities. While recent methods leverage metric maps and Large Language Models (LLMs), they often depend on depth sensors or prebuilt maps, limiting the spatial reasoning ability of Multimodal Large Language Models (MLLMs). Mapless ZSON approaches have emerged to address this, but they typically make short-sighted decisions, leading to local deadlocks due to a lack of historical context. We propose PanoNav, a fully RGB-only, mapless ZSON framework that integrates a Panoramic Scene Parsing module to unlock the spatial parsing potential of MLLMs from panoramic RGB inputs, and a Memory-guided Decision-Making mechanism enhanced by a Dynamic Bounded Memory Queue to incorporate exploration history and avoid local deadlocks. Experiments on the public navigation benchmark show that PanoNav significantly outperforms representative baselines in both SR and SPL metrics.

PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory

TL;DR

PanoNav tackles RGB-only mapless zero-shot object navigation by introducing Panoramic Scene Parsing to extract fine-grained local and global spatial cues from six directional RGB views, paired with dot matrix inputs. It further introduces a Dynamic Bounded Memory Queue to incorporate exploration history, guiding the LLM-based decision-maker to avoid local deadlocks and improve exploration efficiency. On the HM3D benchmark, PanoNav surpasses state-of-the-art baselines in SR and SPL under mapless, open-vocabulary settings, validating both the perceptual parsing and memory-guided decision components. The approach demonstrates that rich panoramic parsing combined with historical context enables robust, open-vocabulary navigation using only RGB inputs, with practical implications for robust, hardware-efficient household robots.

Abstract

Zero-shot object navigation (ZSON) in unseen environments remains a challenging problem for household robots, requiring strong perceptual understanding and decision-making capabilities. While recent methods leverage metric maps and Large Language Models (LLMs), they often depend on depth sensors or prebuilt maps, limiting the spatial reasoning ability of Multimodal Large Language Models (MLLMs). Mapless ZSON approaches have emerged to address this, but they typically make short-sighted decisions, leading to local deadlocks due to a lack of historical context. We propose PanoNav, a fully RGB-only, mapless ZSON framework that integrates a Panoramic Scene Parsing module to unlock the spatial parsing potential of MLLMs from panoramic RGB inputs, and a Memory-guided Decision-Making mechanism enhanced by a Dynamic Bounded Memory Queue to incorporate exploration history and avoid local deadlocks. Experiments on the public navigation benchmark show that PanoNav significantly outperforms representative baselines in both SR and SPL metrics.

Paper Structure

This paper contains 22 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the proposed PanoNav framework. At each timestep, the robot captures six directional RGB images to form a panoramic view. Each image is preprocessed into a dot matrix map, and both the RGB and dot matrix images are fed into an MLLM for spatial parsing. The model outputs local directional descriptions and a global scene summary. Global summaries are stored in a Dynamic Bounded Memory Queue, and together with local and global information, are used by the LLM to make navigation decisions that guide the robot’s movement.
  • Figure 2: Spatial relationship parsing in Panoramic Scene Parsing. The MLLM processes both the original RGB image and its corresponding dot matrix image to extract geometric distance and planar positional relationships between objects, producing textual descriptions of the spatial scene.
  • Figure 3: Workflow of the Dynamic Memory-Guided Decision mechanism. Without memory, the LLM may cause the robot to revisit previously explored areas. Incorporating a memory mechanism guides the LLM to improve exploration and avoid redundant navigation.
  • Figure 4: The top-down view of the navigation trajectories includes six representative examples with different object types. Cases (a)-(d) demonstrate several smooth navigation scenarios, while cases (e) and (f) present more complex situations where the robot initially circled within a local space before successfully escaping and ultimately reaching the target.
  • Figure 5: Visualization Results of the Deadlock Avoidance Test. (a) Memory-less decision (b) Memory-guided decision.