Table of Contents
Fetching ...

Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation

Yiyuan Pan, Yunzhe Xu, Zhe Liu, Hesheng Wang

TL;DR

The paper addresses VLN in unseen environments by introducing SALI, a navigation agent that combines episodic memory with episodic simulation through a reality-imagination hybrid memory. SALI maintains a topological memory map that stores both real observations and imagined content, and uses a recurrent imagination tree to generate high-fidelity future views, all fused via a multimodal transformer to inform actions. Key contributions include the Real-Imaginary Hybrid Memory with dynamic action planning, the Recurrent Imagination Tree for scalable future prediction, and comprehensive pre-training and cross-correction strategies that yield state-of-the-art results on R2R and REVERIE benchmarks. The approach improves navigation robustness and efficiency in complex, unseen environments, demonstrating the practical value of integrating imaginative content with long-term memory for embodied AI.

Abstract

Humans navigate unfamiliar environments using episodic simulation and episodic memory, which facilitate a deeper understanding of the complex relationships between environments and objects. Developing an imaginative memory system inspired by human mechanisms can enhance the navigation performance of embodied agents in unseen environments. However, existing Vision-and-Language Navigation (VLN) agents lack a memory mechanism of this kind. To address this, we propose a novel architecture that equips agents with a reality-imagination hybrid memory system. This system enables agents to maintain and expand their memory through both imaginative mechanisms and navigation actions. Additionally, we design tailored pre-training tasks to develop the agent's imaginative capabilities. Our agent can imagine high-fidelity RGB images for future scenes, achieving state-of-the-art result in Success rate weighted by Path Length (SPL).

Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation

TL;DR

The paper addresses VLN in unseen environments by introducing SALI, a navigation agent that combines episodic memory with episodic simulation through a reality-imagination hybrid memory. SALI maintains a topological memory map that stores both real observations and imagined content, and uses a recurrent imagination tree to generate high-fidelity future views, all fused via a multimodal transformer to inform actions. Key contributions include the Real-Imaginary Hybrid Memory with dynamic action planning, the Recurrent Imagination Tree for scalable future prediction, and comprehensive pre-training and cross-correction strategies that yield state-of-the-art results on R2R and REVERIE benchmarks. The approach improves navigation robustness and efficiency in complex, unseen environments, demonstrating the practical value of integrating imaginative content with long-term memory for embodied AI.

Abstract

Humans navigate unfamiliar environments using episodic simulation and episodic memory, which facilitate a deeper understanding of the complex relationships between environments and objects. Developing an imaginative memory system inspired by human mechanisms can enhance the navigation performance of embodied agents in unseen environments. However, existing Vision-and-Language Navigation (VLN) agents lack a memory mechanism of this kind. To address this, we propose a novel architecture that equips agents with a reality-imagination hybrid memory system. This system enables agents to maintain and expand their memory through both imaginative mechanisms and navigation actions. Additionally, we design tailored pre-training tasks to develop the agent's imaginative capabilities. Our agent can imagine high-fidelity RGB images for future scenes, achieving state-of-the-art result in Success rate weighted by Path Length (SPL).

Paper Structure

This paper contains 20 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Humans utilize episodic memory and episodic simulation to recall past experiences and predict future outcomes in unfamiliar environments. In contrast, navigation agents often struggle in unseen environments due to their inability to construct and leverage such cognitive frameworks.
  • Figure 2: We propose building a hybrid imagination-reality memory for long-term navigation decisions. Based on the navigation observation and trajectories, the agent will imagine future scenes for unvisited environments. The imagination will then be fused into its hybrid memory to aid further decision-making. The figure also illustrates a series of pre-training tasks that we propose for the imagination module.
  • Figure 3: The imagination model includes four pre-trained models (inpaint model, spade model, room-type model, and waypoint model). We propose an end-to-end architecture that allows local imaginary trees to continuously update and maintain themselves, producing high-quality depth images, semantic images, and RGB images.
  • Figure 4: Before and after adding room-type model.
  • Figure 5: Comparison of SR and SPL between SALI and BEVBert models on different sub-datasets.
  • ...and 1 more figures