ARROW: Augmented Replay for RObust World models

Abdulaziz Alyahya; Abdallah Al Siyabi; Markus R. Ernst; Luke Yang; Levin Kuhlmann; Gideon Kowadlo

ARROW: Augmented Replay for RObust World models

Abdulaziz Alyahya, Abdallah Al Siyabi, Markus R. Ernst, Luke Yang, Levin Kuhlmann, Gideon Kowadlo

Abstract

Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.

ARROW: Augmented Replay for RObust World models

Abstract

Paper Structure (50 sections, 13 equations, 8 figures, 17 tables, 2 algorithms)

This paper contains 50 sections, 13 equations, 8 figures, 17 tables, 2 algorithms.

Introduction
The challenge of continual reinforcement learning
Related work and limitations
The neuroscience-inspired alternative
Our contribution
Background
Augmented Replay for RObust World models (ARROW)
World model
Actor-critic controller
Augmented replay buffer
Short-term FIFO buffer
Long-term global distribution matching (LTDM) buffer
Spliced rollouts
Task-agnostic exploration
Experiments
...and 35 more sections

Figures (8)

Figure 1: World Model Learning. (A) Images drawn from the replay buffer are encoded to and reconstructed from a latent space using a recurrent state space model. (B) Learning the policy is achieved with Actor ($\mathcal{A}$) and Critic ($\mathcal{C}$) networks applied to latent states "dreamt-up" by the model.
Figure 2: Experiment setup. (A) Augmented buffer used in ARROW. (B) Continual learning tasks with and without shared structure. NB: no background, RT: restricted themes, GA: generated assets, MA: monochrome assets, CA: centered agent.
Figure 3: Atari median normalized performance (Eq. \ref{['eq:normalization']}). Shaded area depicts 0.25 and 0.75 quartiles of 5 seeds. Bold line segments indicate training of task. (A) Default order of tasks (one-cycle). (B) Reversed order of tasks (one-cycle). (C) Default order of tasks (two-cycle). The dotted, vertical line marks the end of cycle 1 and the beginning of cycle 2.
Figure 4: Atari metrics shown as median with (0.25 - 0.75) quartile confidence intervals, across 5 seeds, and calculated using normalized scores (Eq. \ref{['eq:normalization']}). (A) Default task order (one-cycle). (B) Reversed task order (one-cycle). (C) Default task order (two-cycle).
Figure 5: CoinRun median normalized performance (Eq. equation \ref{['eq:normalization']}). Shaded area depicts 0.25 and 0.75 quartiles of 5 seeds. Bold line segments indicate training of task. (A) Default order of tasks (one-cycle). (B) Reversed order of tasks (one-cycle). (C) Default order of tasks (two-cycle). The dotted vertical line marks the end of cycle 1 and the beginning of cycle 2.
...and 3 more figures

ARROW: Augmented Replay for RObust World models

Abstract

ARROW: Augmented Replay for RObust World models

Authors

Abstract

Table of Contents

Figures (8)