Table of Contents
Fetching ...

ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation

Feng Wu, Wei Zuo, Wenliang Yang, Jun Xiao, Yang Liu, Xinhua Zeng

Abstract

Zero-shot object navigation requires agents to locate unseen target objects in unfamiliar environments without prior maps or task-specific training which remains a significant challenge. Although recent advancements in vision-language models(VLMs) provide promising commonsense reasoning capabilities for this task, these models still suffer from spatial hallucinations, local exploration deadlocks, and a disconnect between high-level semantic intent and low-level control. In this regard, we propose a novel hierarchical navigation framework named ReMemNav, which seamlessly integrates panoramic semantic priors and episodic memory with VLMs. We introduce the Recognize Anything Model to anchor the spatial reasoning process of the VLM. We also design an adaptive dual-modal rethinking mechanism based on an episodic semantic buffer queue. The proposed mechanism actively verifies target visibility and corrects decisions using historical memory to prevent deadlocks. For low-level action execution, ReMemNav extracts a sequence of feasible actions using depth masks, allowing the VLM to select the optimal action for mapping into actual spatial movement. Extensive evaluations on HM3D and MP3D demonstrate that ReMemNav outperforms existing training-free zero-shot baselines in both success rate and exploration efficiency. Specifically, we achieve significant absolute performance improvements, with SR and SPL increasing by 1.7% and 7.0% on HM3D v0.1, 18.2% and 11.1% on HM3D v0.2, and 8.7% and 7.9% on MP3D.

ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation

Abstract

Zero-shot object navigation requires agents to locate unseen target objects in unfamiliar environments without prior maps or task-specific training which remains a significant challenge. Although recent advancements in vision-language models(VLMs) provide promising commonsense reasoning capabilities for this task, these models still suffer from spatial hallucinations, local exploration deadlocks, and a disconnect between high-level semantic intent and low-level control. In this regard, we propose a novel hierarchical navigation framework named ReMemNav, which seamlessly integrates panoramic semantic priors and episodic memory with VLMs. We introduce the Recognize Anything Model to anchor the spatial reasoning process of the VLM. We also design an adaptive dual-modal rethinking mechanism based on an episodic semantic buffer queue. The proposed mechanism actively verifies target visibility and corrects decisions using historical memory to prevent deadlocks. For low-level action execution, ReMemNav extracts a sequence of feasible actions using depth masks, allowing the VLM to select the optimal action for mapping into actual spatial movement. Extensive evaluations on HM3D and MP3D demonstrate that ReMemNav outperforms existing training-free zero-shot baselines in both success rate and exploration efficiency. Specifically, we achieve significant absolute performance improvements, with SR and SPL increasing by 1.7% and 7.0% on HM3D v0.1, 18.2% and 11.1% on HM3D v0.2, and 8.7% and 7.9% on MP3D.

Paper Structure

This paper contains 17 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 2: Navigation with our rethinking and memory-augmented framework.
  • Figure 3: Overview of the ReMemNav framework. At each time step, the agent acquires six-directional RGB-D observations, extracts semantic priors via RAM, and leverages VLM for direction decision-making. The episodic memory buffer queue and adaptive dual-modal rethinking mechanism jointly prevent local deadlocks and false-positive stops.
  • Figure 4: Pipeline of the Episodic Memory Buffer Queue construction based on multi-modal perception. This process illustrates how the system takes the current Panoramic Image and Semantic Prior Dictionary as inputs to generate a response via a VLM. The system then pushes the current Position ($p_t$) and Description ($d_t$) into a time-evolving Memory Buffer , forming a continuous episodic record from time $t-K+1$ to $t$.
  • Figure 5: VLM-guided process for safe action decision-making.
  • Figure 6: Ablation study on episodic memory capacity $K$. The experiment is conducted using Qwen3-VL-4B on HM3D v0.2. The dual y-axis plot illustrates the inverted U-shape trend for both SR and SPL, peaking at $K=10$.