Table of Contents
Fetching ...

On Memory: A comparison of memory mechanisms in world models

Eli J. Laird, Corey Clark

TL;DR

The paper tackles the memory bottleneck in transformer-based world models, which impedes long-horizon planning and loop closures. It develops a taxonomy separating memory encoding (how past information is compressed) from memory injection (how memory is reintroduced into the residual stream) and evaluates numerous encoding/injection combinations on MemoryMaze using a Vision Transformer backbone. Key findings show that memory augmentation improves recent-context recall and loop-closure potential, with cache-based memory and context-prepended injections delivering the strongest reconstruction and latent accuracy, while state-space memories offer a compact alternative. The work highlights trade-offs between memory capacity, stability, and computation, and points toward hybrid designs that combine encoding and injection strengths for longer horizons.

Abstract

World models enable agents to plan within imagined environments by predicting future states conditioned on past observations and actions. However, their ability to plan over long horizons is limited by the effective memory span of the backbone architecture. This limitation leads to perceptual drift in long rollouts, hindering the model's capacity to perform loop closures within imagined trajectories. In this work, we investigate the effective memory span of transformer-based world models through an analysis of several memory augmentation mechanisms. We introduce a taxonomy that distinguishes between memory encoding and memory injection mechanisms, motivating their roles in extending the world model's memory through the lens of residual stream dynamics. Using a state recall evaluation task, we measure the memory recall of each mechanism and analyze its respective trade-offs. Our findings show that memory mechanisms improve the effective memory span in vision transformers and provide a path to completing loop closures within a world model's imagination.

On Memory: A comparison of memory mechanisms in world models

TL;DR

The paper tackles the memory bottleneck in transformer-based world models, which impedes long-horizon planning and loop closures. It develops a taxonomy separating memory encoding (how past information is compressed) from memory injection (how memory is reintroduced into the residual stream) and evaluates numerous encoding/injection combinations on MemoryMaze using a Vision Transformer backbone. Key findings show that memory augmentation improves recent-context recall and loop-closure potential, with cache-based memory and context-prepended injections delivering the strongest reconstruction and latent accuracy, while state-space memories offer a compact alternative. The work highlights trade-offs between memory capacity, stability, and computation, and points toward hybrid designs that combine encoding and injection strengths for longer horizons.

Abstract

World models enable agents to plan within imagined environments by predicting future states conditioned on past observations and actions. However, their ability to plan over long horizons is limited by the effective memory span of the backbone architecture. This limitation leads to perceptual drift in long rollouts, hindering the model's capacity to perform loop closures within imagined trajectories. In this work, we investigate the effective memory span of transformer-based world models through an analysis of several memory augmentation mechanisms. We introduce a taxonomy that distinguishes between memory encoding and memory injection mechanisms, motivating their roles in extending the world model's memory through the lens of residual stream dynamics. Using a state recall evaluation task, we measure the memory recall of each mechanism and analyze its respective trade-offs. Our findings show that memory mechanisms improve the effective memory span in vision transformers and provide a path to completing loop closures within a world model's imagination.

Paper Structure

This paper contains 22 sections, 2 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Visual comparison of twenty imagined steps in the MemoryMaze environment for the ViT baseline and top encoder-injection pairs for each encoder type. The top (highlighted in green) is the ground truth, followed by the vanilla ViT baseline, cached memory pre-pended to the context window, cross-attention to SSM-encoded memories, and Titans-based neural memory with LoRA-based injections.