Mastering Memory Tasks with World Models

Mohammad Reza Samsami; Artem Zholus; Janarthanan Rajendran; Sarath Chandar

Mastering Memory Tasks with World Models

Mohammad Reza Samsami, Artem Zholus, Janarthanan Rajendran, Sarath Chandar

TL;DR

This work tackles the challenge of long-term dependencies in model-based RL by introducing Recall to Imagine (R2I), which embeds a Structured State Space Model (S3M) into a DreamerV3-based world model to enable enduring memory and improved long-horizon credit assignment. The core idea is to replace the recurrent posterior with a non-recurrent one to allow parallel imagination, while maintaining a robust temporal model through SSMs. Empirically, R2I delivers state-of-the-art results in memory-intensive domains (BSuite, POPGym, Memory Maze), even surpassing human performance in Memory Maze, and retains competitive performance on Atari and DMC benchmarks, all with up to 9x faster wall-time convergence. The work demonstrates that SSM-based world models can generalize across memory and non-memory domains, offering substantial gains in sample efficiency and memory capability for RL systems.

Abstract

Current model-based reinforcement learning (MBRL) agents struggle with long-term dependencies. This limits their ability to effectively solve tasks involving extended time gaps between actions and outcomes, or tasks demanding the recalling of distant observations to inform current actions. To improve temporal coherence, we integrate a new family of state space models (SSMs) in world models of MBRL agents to present a new method, Recall to Imagine (R2I). This integration aims to enhance both long-term memory and long-horizon credit assignment. Through a diverse set of illustrative tasks, we systematically demonstrate that R2I not only establishes a new state-of-the-art for challenging memory and credit assignment RL tasks, such as BSuite and POPGym, but also showcases superhuman performance in the complex memory domain of Memory Maze. At the same time, it upholds comparable performance in classic RL tasks, such as Atari and DMC, suggesting the generality of our method. We also show that R2I is faster than the state-of-the-art MBRL method, DreamerV3, resulting in faster wall-time convergence.

Mastering Memory Tasks with World Models

TL;DR

Abstract

Paper Structure (37 sections, 9 equations, 27 figures, 6 tables)

This paper contains 37 sections, 9 equations, 27 figures, 6 tables.

Introduction
Background
State Space Models
From Imagination To Action
Methodology
World Model Details
Actor-Critic Details
Experiments
Quantifying Memory of R2I
Evaluating Long-term Memory In Complex 3D Tasks
Assessing the Generality of R2I in Non-Memory Domains
Conclusion
Appendix
Related Work
Variations of SSMs and Our Design Choices
...and 22 more sections

Figures (27)

Figure 1: Graphical representation of R2I. (Left) The world model encodes past experiences, transforming observations and actions into compact latent states. Reconstructing the trajectories serves as a learning signal for shaping these latent states. (Right) The policy learns from trajectories based on latent states imagined by the world model. The representation corresponds to the full state policy, and we have omitted the critic for the sake of simplifying the illustration.
Figure 2: Computational time taken by DreamerV3 and R2I (lower is preferred)
Figure 3: Success rates of DreamerV3 (which holds the previous SOTA) and R2I in BSuite environments. A separate model is trained for every point on the x-axis. A median value (over 10 seeds) is plotted filling between $25$-th and $75$-th percentiles. Training curves are in Appendix \ref{['appendix:bsuite']}.
Figure 4: R2I results in memory-intensive environments of POPGym. Our method establishes the new SOTA in the hardest memory environments; Autoencode: -Easy, -Medium; RepeatPrevious: -Medium, -Hard; Concentration: -Medium. Note that Concentration is a task that can be partially solved without memory. For PPO+S4D, refer to Appendix \ref{['app:s4d']}.
Figure 5: Scores in Memory Maze after 400M environment steps. R2I outperforms baselines across difficulty levels, becoming the domain's new SOTA. Due to its enhanced computational efficiency, R2I was trained during a fewer number of days compared to Dreamer, as illustrated in Figure \ref{['fig:mmaze_wall']}.
...and 22 more figures

Mastering Memory Tasks with World Models

TL;DR

Abstract

Mastering Memory Tasks with World Models

Authors

TL;DR

Abstract

Table of Contents

Figures (27)