Table of Contents
Fetching ...

HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System

Kailin Lyu, Kangyi Wu, Pengna Li, Xiuyu Hu, Qingyi Si, Cui Miao, Ning Yang, Zihang Wang, Long Xiao, Lianyu Hu, Jingyuan Sun, Ce Hao

Abstract

LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which face challenges related to high token costs and potential data leakage risks. Recent efforts have attempted to address this by using open-source LLMs combined with a spatiotemporal CoT framework, but they still fall far short compared to closed-source models. In this work, we identify a critical issue, Navigation Amnesia, through a detailed analysis of the navigation process. This issue leads to navigation failures and amplifies the gap between open-source and closed-source methods. To address this, we propose HiMemVLN, which incorporates a Hierarchical Memory System into a multimodal large model to enhance visual perception recall and long-term localization, mitigating the amnesia issue and improving the agent's navigation performance. Extensive experiments in both simulated and real-world environments demonstrate that HiMemVLN achieves nearly twice the performance of the open-source state-of-the-art method. The code is available at https://github.com/lvkailin0118/HiMemVLN.

HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System

Abstract

LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which face challenges related to high token costs and potential data leakage risks. Recent efforts have attempted to address this by using open-source LLMs combined with a spatiotemporal CoT framework, but they still fall far short compared to closed-source models. In this work, we identify a critical issue, Navigation Amnesia, through a detailed analysis of the navigation process. This issue leads to navigation failures and amplifies the gap between open-source and closed-source methods. To address this, we propose HiMemVLN, which incorporates a Hierarchical Memory System into a multimodal large model to enhance visual perception recall and long-term localization, mitigating the amnesia issue and improving the agent's navigation performance. Extensive experiments in both simulated and real-world environments demonstrate that HiMemVLN achieves nearly twice the performance of the open-source state-of-the-art method. The code is available at https://github.com/lvkailin0118/HiMemVLN.
Paper Structure (16 sections, 5 equations, 7 figures, 2 tables)

This paper contains 16 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: (a) Comparison between GPT-based Navigator and open-source LLM-based Navigator. (b) The Phenomenon of "Navigation Amnesia". Short term amnesia leads the agent to enter local loops or perform redundant exploration, whereas long term amnesia causes it to forget global instruction constraints, making decisions drift from the intended route.
  • Figure 2: Overview of HiMemVLN. Built upon a MLLM, HiMemVLN integrates short-term and long-term memory systems to accomplish task execution through a closed-loop memory-reasoning-execution process.
  • Figure 3: Workflow of the hierarchical memory system. The visually driven Short-Term Localer mimics human spatial reasoning to detect revisits and reduce redundant exploration. The semantically driven Long-Term Globaler mirrors human global reflection to preserve origin awareness and directional alignment, ensuring long-horizon consistency.
  • Figure 4: The Go2W robot and the arrangement of the real-world environment.
  • Figure 5: Qualitative results of OpenNav and HiMemVLN in simulation environments. In contrast to the state-of-the-art open-source method , OpenNav, which exhibits short-term and long-term amnesia reflected by looping and deviating behaviors in the red dashed boxes, our method eliminates navigation amnesia and accurately follows the given instructions.
  • ...and 2 more figures