Table of Contents
Fetching ...

ESceme: Vision-and-Language Navigation with Episodic Scene Memory

Qi Zheng, Daqing Liu, Chaoyue Wang, Jing Zhang, Dadong Wang, Dacheng Tao

TL;DR

This work presents Episodic Scene memory (ESceme) for vision-and-language navigation, a memory-based mechanism that stores and reuses scene-level information across episodes to improve decision-making without extra annotations or heavy computation. EScheme uses a per-scene episodic memory graph and a candidate-enhancing module to fuse memory with current observations, enabling an agent to envision a broader context during navigation. The approach, trained with a hybrid imitation-reinforcement objective, achieves state-of-the-art results on R2R, R4R, and CVDN, notably improving unseen performance and long-horizon navigation while maintaining efficiency. The findings suggest episodic memory as a powerful direction for robust VLN, with practical impact for real-world embodied agents that must operate across multiple visits to the same environments.

Abstract

Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: \url{https://github.com/qizhust/esceme}.

ESceme: Vision-and-Language Navigation with Episodic Scene Memory

TL;DR

This work presents Episodic Scene memory (ESceme) for vision-and-language navigation, a memory-based mechanism that stores and reuses scene-level information across episodes to improve decision-making without extra annotations or heavy computation. EScheme uses a per-scene episodic memory graph and a candidate-enhancing module to fuse memory with current observations, enabling an agent to envision a broader context during navigation. The approach, trained with a hybrid imitation-reinforcement objective, achieves state-of-the-art results on R2R, R4R, and CVDN, notably improving unseen performance and long-horizon navigation while maintaining efficiency. The findings suggest episodic memory as a powerful direction for robust VLN, with practical impact for real-world embodied agents that must operate across multiple visits to the same environments.

Abstract

Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: \url{https://github.com/qizhust/esceme}.
Paper Structure (14 sections, 7 equations, 12 figures, 13 tables)

This paper contains 14 sections, 7 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: The blue trajectory shows an agent carrying out instruction 1. The next time, the agent enters this scene to conduct the second instruction along the red path. ESceme allows it to recall the visited nodes (i.e., the blue ones) at where it is standing (A) and choose the neighboring node B$_1$ that will see “the white bookshelf” in one more step at C. Finally, it navigates towards the red dash route and reaches the target.
  • Figure 2: An overview of the Episodic Scene memory mechanism for VLN. On the left is partial episodic memory for the current scene, which gets updated in navigation 1) following the previous instruction, i.e., the blue route, and 2) following the current instruction from Step 1 to $t-1$, i.e., the solid red trajectory. The cyan nodes are those viewed but not visited. The shadow box shows the memory of node B$_1$, which has six adjacent neighbors, i.e., A, B$_2$, B$_5$, C, D, and E. The integration of these nodes consists of the memory of B$_1$. At Step $t$, the agent stands at Node A and is expected to choose one node from B$_1$ to B$_5$. Given observation from K views, each view retrieves its memory in ESceme and produces $\{\mathbf{m}_1,...,\mathbf{m}_K\}$. The memory representation then fuses with original encoded observations, which yields $\{\mathbf{o}_1,...,\mathbf{o}_K,\mathbf{o}_s\}$. $o_s$ is the representation for STOP. The enhanced observations, instruction text, and history from Step 1 to $t-1$ compose the input to a navigation network to predict the action $a_t=i\in \{1,...,K,s\}$. Generally, a navigation network uses the encoded features of the original K views as the input to the cross-modal encoder, i.e., the output ①. Our ESceme exploits the enhanced observations from ②.
  • Figure 3: Episodic memory construction of a scene during navigation. ESceme at the beginning of each time step is presented in the figures, which comprises green nodes and edges and is empty at the beginning of $t=1$. The blue nodes indicate the current location of following the first instruction at each time step, and the red ones correspond to the second instruction. The small cyan nodes mark the remaining navigable viewpoints of the current location. Nodes with green boundary are the chosen viewpoints in each time step. ESceme at the end of that time step is updated by the node with green boundary and the dashed lines connected to its existing nodes. Please refer to Fig. \ref{['fig:demo']} for a complete global graph of the scene, which is unavailable to the agent either in navigation or ESceme construction.
  • Figure 4: Navigation quality w.r.t. inferring progress. The x-axis indicates the ratio of samples tested, and the y-axis is the smoothed average of SPL or CLS. We use the default order for all the methods. Navigation with ESceme improves over time.
  • Figure 5: An overview of ESceme-assisted navigation by graph encoding. First, Episodic memory is built in the same way as that for candidate enhancing (c.f. Section 3.2). Then, the agent searches the episodic memory for the current viewpoint and obtains the memory graph by masking a local window. The encoded memory composes a separate branch to the cross-modal encoder.
  • ...and 7 more figures