Table of Contents
Fetching ...

Building spatial world models from sparse transitional episodic memories

Zizhan He, Maxime Daigle, Pouya Bashivan

TL;DR

This work addresses how a neural model can rapidly construct a spatial world model from sparse episodic memories. The Episodic Spatial World Model (ESWM) meta-trains across diverse environments to infer unseen transitions from minimal one-step memories, producing a latent space that maps closely to actual environment geometry. ESWM enables near-optimal exploration and navigation in novel spaces, supports fast adaptation to structural changes through memory edits, and offers planning capabilities via imagination and latent-heuristic search. The approach decouples memory and reasoning, allowing flexible inference and robust performance in obstacle-rich environments with limited data, which has implications for data-efficient autonomous navigation and cognitive-inspired spatial reasoning.

Abstract

Many animals possess a remarkable capacity to rapidly construct flexible mental models of their environments. These world models are crucial for ethologically relevant behaviors such as navigation, exploration, and planning. The ability to form episodic memories and make inferences based on these sparse experiences is believed to underpin the efficiency and adaptability of these models in the brain. Here, we ask: Can a neural network learn to construct a spatial model of its surroundings from sparse and disjoint episodic memories? We formulate the problem in a simulated world and propose a novel framework, the Episodic Spatial World Model (ESWM), as a potential answer. We show that ESWM is highly sample-efficient, requiring minimal observations to construct a robust representation of the environment. It is also inherently adaptive, allowing for rapid updates when the environment changes. In addition, we demonstrate that ESWM readily enables near-optimal strategies for exploring novel environments and navigating between arbitrary points, all without the need for additional training.

Building spatial world models from sparse transitional episodic memories

TL;DR

This work addresses how a neural model can rapidly construct a spatial world model from sparse episodic memories. The Episodic Spatial World Model (ESWM) meta-trains across diverse environments to infer unseen transitions from minimal one-step memories, producing a latent space that maps closely to actual environment geometry. ESWM enables near-optimal exploration and navigation in novel spaces, supports fast adaptation to structural changes through memory edits, and offers planning capabilities via imagination and latent-heuristic search. The approach decouples memory and reasoning, allowing flexible inference and robust performance in obstacle-rich environments with limited data, which has implications for data-efficient autonomous navigation and cognitive-inspired spatial reasoning.

Abstract

Many animals possess a remarkable capacity to rapidly construct flexible mental models of their environments. These world models are crucial for ethologically relevant behaviors such as navigation, exploration, and planning. The ability to form episodic memories and make inferences based on these sparse experiences is believed to underpin the efficiency and adaptability of these models in the brain. Here, we ask: Can a neural network learn to construct a spatial model of its surroundings from sparse and disjoint episodic memories? We formulate the problem in a simulated world and propose a novel framework, the Episodic Spatial World Model (ESWM), as a potential answer. We show that ESWM is highly sample-efficient, requiring minimal observations to construct a robust representation of the environment. It is also inherently adaptive, allowing for rapid updates when the environment changes. In addition, we demonstrate that ESWM readily enables near-optimal strategies for exploring novel environments and navigating between arbitrary points, all without the need for additional training.

Paper Structure

This paper contains 36 sections, 1 equation, 14 figures, 4 algorithms.

Figures (14)

  • Figure 1: Episodic Spatial World Model.a) Three common scenarios that hinder the ability of typical world models to generalize effectively. (top) Observation of the full environment may take many time steps leading to long sequences; (middle) Specific parts of the environment may be changed dynamically; (bottom) Information about a particular environment may be collected across separate exposures to the environment and not within a single one. b) . Memory bank and query selection in a square grid environment. c) Architecture of ESWM and training procedure. Model input consists of a bank of transitional memories (corresponding to the black arrows in (b)) and a single query (q arrow in (b)), with either start-state, action, or end-state randomly masked with equal chances. The sequence of transitions is processed by a sequence model (e.g, Transformer encoder block) and the model parameters are updated to output the correct value for the masked component.
  • Figure 2: Evaluation accuracy in a) Open Arena and b) Random Wall. In both schematics, blue arrows are transitions in the memory bank and the $q$-labeled arrows are examples of query transitions $q=(s_s,a,s_e)$. In Open Arena, the states are represented as 6-bit binaries and the model is tested in environments entirely filled with unseen states, while in Random Wall, the states are integers and the model is tested on its generalizability to unseen wall patterns. In addition, a random subregion is masked in Random Wall and the query $q$ can be either unsolvable, unseen, or seen, depicted as $q$-labeled arrows with different colors. The Random Wall displayed here is a scaled down version of what the model is trained on (19 locations versus 36 locations). XFMR is short for Transformer.
  • Figure 3: Spatial map emerges in ESWM's latent space.a), b) ISOMAP projections of the 2-layer transformer model's first layer's activations for action prediction. Different columns are the same projection from different viewing angles. a) Spatial map in the absence of obstacles and b) in the presence of obstacles (a straight wall). c), d) ISOMAP projections of the 14-layer transformer model’s seventh layer’s activations. c) From left to right: A sample room with two disjoint regions whose shapes match boundaries' shape; ESWM’s latent space when a memory bank observing either the top or bottom region is given as input; ESWM's latent space when a memory bank observing both regions is given as input. d) From left to right: A sample room containing two disjoint regions whose shapes give no cues about their relative position; ESWM's latent space when a memory bank of the room is given as input; Updated room with a new wall, the regions remain disjoint; ESWM's latent space when a memory bank of the updated room is given as input. See Fig.\ref{['fig:A1']}, \ref{['fig:A2']}, \ref{['fig:A3']}, \ref{['fig:A4']}, \ref{['fig:A5']} for more examples.
  • Figure 4: ESWM integrates memories. a) X-axis represents the shortest path length between the source and end states in the query, with edges corresponding to episodic memories in the memory bank, and the Y-axis is the entropy of ESWM's prediction probabilities (n=5000). b) KL divergence between ESWM prediction distributions before and after adding an episodic memory to the memory bank. An informative episodic memory shortens the integration path required for the model to solve the prediction task, while a non-informative episodic memory does not alter the shortest integration path (n=5000). c) Prediction accuracies for unseen transistions as ESWM receive larger, out-of-distribution, memory banks (n=2000). Transformer -14L is used for a) and b) while 4L is used for c).
  • Figure 5: Exploration, navigation, and adaptability.a) Comparison of exploration strategies based on the number of unique states visited over 15 time steps in Random Wall. The optimal agent explores along the path found by the Traveling Salesman Algorithm over known free space. N=1000. b) Comparison of navigation success rate and path optimality over path lengths between EPN and ESWM, n=2400. c) Comparison of navigation success rates with increasing number of unexpected obstacles. The baseline agent is trained on the original environment and tested after structural changes, while ESWM navigates with memory banks from the original environment and needs to autonomously adapt to changes (n=100). Random Wall Experiments include 19 locations.
  • ...and 9 more figures