Table of Contents
Fetching ...

MemoNav: Working Memory Model for Visual Navigation

Hongxin Li, Zeyu Wang, Xu Yang, Yuran Yang, Shuqi Mei, Zhaoxiang Zhang

TL;DR

MemoNav is presented, a novel memory model for image-goal navigation, which utilizes a working memory-inspired pipeline to improve navigation performance and significantly outperforms previous methods across all difficulty levels in both Gibson and Matterport3D scenes.

Abstract

Image-goal navigation is a challenging task that requires an agent to navigate to a goal indicated by an image in unfamiliar environments. Existing methods utilizing diverse scene memories suffer from inefficient exploration since they use all historical observations for decision-making without considering the goal-relevant fraction. To address this limitation, we present MemoNav, a novel memory model for image-goal navigation, which utilizes a working memory-inspired pipeline to improve navigation performance. Specifically, we employ three types of navigation memory. The node features on a map are stored in the short-term memory (STM), as these features are dynamically updated. A forgetting module then retains the informative STM fraction to increase efficiency. We also introduce long-term memory (LTM) to learn global scene representations by progressively aggregating STM features. Subsequently, a graph attention module encodes the retained STM and the LTM to generate working memory (WM) which contains the scene features essential for efficient navigation. The synergy among these three memory types boosts navigation performance by enabling the agent to learn and leverage goal-relevant scene features within a topological map. Our evaluation on multi-goal tasks demonstrates that MemoNav significantly outperforms previous methods across all difficulty levels in both Gibson and Matterport3D scenes. Qualitative results further illustrate that MemoNav plans more efficient routes.

MemoNav: Working Memory Model for Visual Navigation

TL;DR

MemoNav is presented, a novel memory model for image-goal navigation, which utilizes a working memory-inspired pipeline to improve navigation performance and significantly outperforms previous methods across all difficulty levels in both Gibson and Matterport3D scenes.

Abstract

Image-goal navigation is a challenging task that requires an agent to navigate to a goal indicated by an image in unfamiliar environments. Existing methods utilizing diverse scene memories suffer from inefficient exploration since they use all historical observations for decision-making without considering the goal-relevant fraction. To address this limitation, we present MemoNav, a novel memory model for image-goal navigation, which utilizes a working memory-inspired pipeline to improve navigation performance. Specifically, we employ three types of navigation memory. The node features on a map are stored in the short-term memory (STM), as these features are dynamically updated. A forgetting module then retains the informative STM fraction to increase efficiency. We also introduce long-term memory (LTM) to learn global scene representations by progressively aggregating STM features. Subsequently, a graph attention module encodes the retained STM and the LTM to generate working memory (WM) which contains the scene features essential for efficient navigation. The synergy among these three memory types boosts navigation performance by enabling the agent to learn and leverage goal-relevant scene features within a topological map. Our evaluation on multi-goal tasks demonstrates that MemoNav significantly outperforms previous methods across all difficulty levels in both Gibson and Matterport3D scenes. Qualitative results further illustrate that MemoNav plans more efficient routes.
Paper Structure (28 sections, 1 equation, 14 figures, 3 tables)

This paper contains 28 sections, 1 equation, 14 figures, 3 tables.

Figures (14)

  • Figure 1: A brief example of MemoNav. MemoNav calculates attention scores for each node on the topological map and then excludes the nodes with low scores (the black nodes in the figure) during decision-making. This design helps our agent focus more on goal-relevant scene features, boosting multi-goal visual navigation performance.
  • Figure 2: Overview of MemoNav. (a) The memory update module builds a topological map using ${\bm{e}}_t$, the embedding of the current image ${\bm{\mathsfit{I}}}_t$. (b) The node features in the map constitute the STM while a global node that links to each node acts as the LTM. (c) The forgetting module temporarily excludes a fraction of STM whose attention scores rank below a threshold $p$. (d) The retained STM and the LTM are concatenated and then encoded by (e) a graph attention module to generate the WM ${\bm{M}}_w^t$. (f) The WM is decoded by two Transformer decoders (details in \ref{['fig:detailed_memonav']}). (g) Lastly, the output of the decoding process is input to a policy network to generate navigation actions.
  • Figure 3: An example episode for multi-goal tasks in Gibson. The agent is tasked with navigating to multiple sequential goals.
  • Figure 4: Navigation performance versus forgetting threshold $p$ in the Gibson scenes. MemoNav achieves the best performance on easier tasks with a lower $p$ but a higher $p$ is more beneficial for harder tasks. Moreover, MemoNav maintains high SR/PR with just 20% of STM on the 3-goal tasks and enjoys a higher $p$ on the 4-goal tasks.
  • Figure 5: Visualization of example episodes from a top-down view. We compare CNNLSTM, VGM, and MemoNav at four difficulty levels in the Gibson scenes. Our MemoNav plans more efficient paths compared to the other two methods. For instance, in the 3-goal example, MemoNav quickly reaches the third goal which is located at an explored area. Best viewed in color.
  • ...and 9 more figures