Table of Contents
Fetching ...

Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation

Yunzhe Xu, Yiyuan Pan, Zhe Liu

TL;DR

Memoir tackles memory-persistent Vision-and-Language Navigation by grounding imaginative future-state queries in explicit long-term memory. It combines a language-conditioned world model, Hybrid Viewpoint-Level Memory, and an experience-augmented navigation model to adaptively retrieve both environmental observations and navigation histories. Across IR2R and GSA-R2R benchmarks, Memoir delivers consistent SPL gains, substantial training speedups, and marked inference-memory reductions, illustrating the value of predictive retrieval for embodied agents. The work also analyzes retrieval upper bounds and failure modes, pointing to future improvements in world modeling and confidence-aware exploration. Overall, Memoir demonstrates that imagination-guided, memory-grounded reasoning can significantly enhance memory-persistent VLN in both performance and efficiency.

Abstract

Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at https://github.com/xyz9911/Memoir.

Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation

TL;DR

Memoir tackles memory-persistent Vision-and-Language Navigation by grounding imaginative future-state queries in explicit long-term memory. It combines a language-conditioned world model, Hybrid Viewpoint-Level Memory, and an experience-augmented navigation model to adaptively retrieve both environmental observations and navigation histories. Across IR2R and GSA-R2R benchmarks, Memoir delivers consistent SPL gains, substantial training speedups, and marked inference-memory reductions, illustrating the value of predictive retrieval for embodied agents. The work also analyzes retrieval upper bounds and failure modes, pointing to future improvements in world modeling and confidence-aware exploration. Overall, Memoir demonstrates that imagination-guided, memory-grounded reasoning can significantly enhance memory-persistent VLN in both performance and efficiency.

Abstract

Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at https://github.com/xyz9911/Memoir.

Paper Structure

This paper contains 22 sections, 20 equations, 6 figures, 13 tables, 3 algorithms.

Figures (6)

  • Figure 1: Overview of Memoir's workflow for experience retrieval via imagination. (a) In previous episodes (1 and 2), the agent populates the history bank with latent states encoded by the world model, and fills the observation bank with observations. (b) In the current episode (3), the agent utilizes world model imagination to generate retrieval queries and retrieves memory from both memory banks at each viewpoint for navigation planning. Compared with GR-DUET 2025gsa that incorporates all retained observation memory and OVER-NAV 2024overnav that only applies fixed-horizon lookup, our approach adaptively retrieves both observation and histories for navigation planning through imagination.
  • Figure 2: Details of imagination-guided experience retrieval. (a) The world model learns state-observation compatibility through contrastive training (top). During navigation, it infers the current state from observations and instruction, then recursively imagines future states (bottom). (b) Imagined trajectories enable dual retrieval: histories via state sequence similarity matching, and observations via topological searching based on state-observation compatibility. (c) Three specialized encoders process retrieved navigation histories, local observations, and retrieved observations respectively to determine the final action.
  • Figure 3: Visualization of Memoir's memory retrieval from environmental observation bank and navigation history bank as well as the panoramic trajectory visualization. We compare the navigation result between DUET , GR-DUET and ours. The goal location is indicated by checkered flag.
  • Figure 4: Performance scaling across tour progression on IR2R.
  • Figure 5: Study of hyper-parameters of retrieval on IR2R val-unseen. Left: Environmental observation retrieval. Right: Navigation history retrieval.
  • ...and 1 more figures