ESceme: Vision-and-Language Navigation with Episodic Scene Memory
Qi Zheng, Daqing Liu, Chaoyue Wang, Jing Zhang, Dadong Wang, Dacheng Tao
TL;DR
This work presents Episodic Scene memory (ESceme) for vision-and-language navigation, a memory-based mechanism that stores and reuses scene-level information across episodes to improve decision-making without extra annotations or heavy computation. EScheme uses a per-scene episodic memory graph and a candidate-enhancing module to fuse memory with current observations, enabling an agent to envision a broader context during navigation. The approach, trained with a hybrid imitation-reinforcement objective, achieves state-of-the-art results on R2R, R4R, and CVDN, notably improving unseen performance and long-horizon navigation while maintaining efficiency. The findings suggest episodic memory as a powerful direction for robust VLN, with practical impact for real-world embodied agents that must operate across multiple visits to the same environments.
Abstract
Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: \url{https://github.com/qizhust/esceme}.
