Table of Contents
Fetching ...

Evaluating Long-Term Memory in 3D Mazes

Jurgis Pasukonis, Timothy Lillicrap, Danijar Hafner

TL;DR

The paper introduces Memory Maze, a 3D, partially observable benchmark designed to isolate long-term memory in RL and sequence models. It provides online, offline, and probing evaluation protocols, plus a large offline dataset of 30M trajectories for 9x9 and 15x15 mazes. Empirical results show humans outperform RL on the largest mazes, while truncated backpropagation through time and memory-focused auxiliary probing significantly improve memory performance, especially in smaller mazes. The work delivers open-source infrastructure and datasets to advance memory-centric research and suggests future directions to close the gap to human memory capabilities.

Abstract

Intelligent agents need to remember salient information to reason in partially-observed environments. For example, agents with a first-person view should remember the positions of relevant objects even if they go out of view. Similarly, to effectively navigate through rooms agents need to remember the floor plan of how rooms are connected. However, most benchmark tasks in reinforcement learning do not test long-term memory in agents, slowing down progress in this important research direction. In this paper, we introduce the Memory Maze, a 3D domain of randomized mazes specifically designed for evaluating long-term memory in agents. Unlike existing benchmarks, Memory Maze measures long-term memory separate from confounding agent abilities and requires the agent to localize itself by integrating information over time. With Memory Maze, we propose an online reinforcement learning benchmark, a diverse offline dataset, and an offline probing evaluation. Recording a human player establishes a strong baseline and verifies the need to build up and retain memories, which is reflected in their gradually increasing rewards within each episode. We find that current algorithms benefit from training with truncated backpropagation through time and succeed on small mazes, but fall short of human performance on the large mazes, leaving room for future algorithmic designs to be evaluated on the Memory Maze.

Evaluating Long-Term Memory in 3D Mazes

TL;DR

The paper introduces Memory Maze, a 3D, partially observable benchmark designed to isolate long-term memory in RL and sequence models. It provides online, offline, and probing evaluation protocols, plus a large offline dataset of 30M trajectories for 9x9 and 15x15 mazes. Empirical results show humans outperform RL on the largest mazes, while truncated backpropagation through time and memory-focused auxiliary probing significantly improve memory performance, especially in smaller mazes. The work delivers open-source infrastructure and datasets to advance memory-centric research and suggests future directions to close the gap to human memory capabilities.

Abstract

Intelligent agents need to remember salient information to reason in partially-observed environments. For example, agents with a first-person view should remember the positions of relevant objects even if they go out of view. Similarly, to effectively navigate through rooms agents need to remember the floor plan of how rooms are connected. However, most benchmark tasks in reinforcement learning do not test long-term memory in agents, slowing down progress in this important research direction. In this paper, we introduce the Memory Maze, a 3D domain of randomized mazes specifically designed for evaluating long-term memory in agents. Unlike existing benchmarks, Memory Maze measures long-term memory separate from confounding agent abilities and requires the agent to localize itself by integrating information over time. With Memory Maze, we propose an online reinforcement learning benchmark, a diverse offline dataset, and an offline probing evaluation. Recording a human player establishes a strong baseline and verifies the need to build up and retain memories, which is reflected in their gradually increasing rewards within each episode. We find that current algorithms benefit from training with truncated backpropagation through time and succeed on small mazes, but fall short of human performance on the large mazes, leaving room for future algorithmic designs to be evaluated on the Memory Maze.
Paper Structure (20 sections, 13 figures, 9 tables)

This paper contains 20 sections, 13 figures, 9 tables.

Figures (13)

  • Figure 1: The first 150 time steps of an episode in the Memory Maze 9x9 environment. The bottom row shows the top-down view of a randomly generated maze with 3 colored objects. The agent only observes the first-person view (top row) which includes a prompt for the next object to find as a border of the corresponding color. The agent receives +1 reward when it reaches the object of the prompted color. During the episode, the agent has to visit the same objects multiple times, testing its ability to memorize their positions, the way the rooms are connected, and its own location.
  • Figure 2: Examples of randomly generated Memory Maze layouts of the four sizes.
  • Figure 3: Online RL benchmark results after 100M environment steps of training. Error bars show the standard deviation over 5 runs. We find that current algorithms benefit from training with truncated backpropagation through time and succeed on small mazes, but fall short of human performance on the large mazes, leaving room for future algorithmic designs to be evaluated on the Memory Maze.
  • Figure 4: Comparison of Dreamer (TBTT), trained with the standard world model loss, against Dreamer (TBTT + Probe loss) agent, where we add an auxiliary Object location probe prediction loss, encouraging the model to remember relevant information. Dreamer (TBTT + Probe loss) shows a significant improvement, indicating that the task performance is indeed bottlenecked by memory. Since this agent uses additional probe information during training time, it should not be considered as a baseline of the online RL benchmark.
  • Figure 5: Offline probing results on Memory 9x9 and Memory 15x15 datasets. Left: Average accuracy of wall probing (higher is better), with the perfect score being $100\%$ and VAE indicating a no-memory baseline. Right: Average mean-squared error (MSE) of object probing (lower is better), with the perfect score being 0 and VAE indicating a no-memory baseline.
  • ...and 8 more figures