Table of Contents
Fetching ...

Investigating Memory in RL with POPGym Arcade

Zekang Wang, Zhe He, Borong Zhang, Edan Toledo, Steven Morad

TL;DR

The paper tackles the challenge of fairly evaluating memory in deep RL under partial observability by introducing a memory-analysis toolkit and POPGym Arcade, a hardware-accelerated Atari-like benchmark with MDP/POMDP twins sharing identical observation/actions. It presents formal tools such as the Observability Gap, Memory Bias, Recall Density, and Pixel Visualizations to disentangle memory effects from policy performance and to interpret how memory is used. Key findings include a memory-related bias where value can smear across irrelevant history (Value Smearing) and the demonstration that OOD observations can contaminate recurrent states, perturbing decisions far into the future, with implications for offline RL and sim-to-real transfer. The work enables controlled, high-throughput experiments and provides a foundation for fair memory evaluations and robust memory-aware RL research.

Abstract

How should we analyze memory in deep RL? We introduce mathematical tools for fairly analyzing policies under partial observability and revealing how agents use memory to make decisions. To utilize these tools, we present POPGym Arcade, a collection of Atari-inspired, hardware-accelerated, pixel-based environments sharing a single observation and action space. Each environment provides fully and partially observable variants, enabling counterfactual studies on observability. We find that controlled studies are necessary for fair comparisons, and identify a pathology where value functions smear credit over irrelevant history. With this pathology, we demonstrate how out-of-distribution scenarios can contaminate memory, perturbing the policy far into the future, with implications for sim-to-real transfer and offline RL.

Investigating Memory in RL with POPGym Arcade

TL;DR

The paper tackles the challenge of fairly evaluating memory in deep RL under partial observability by introducing a memory-analysis toolkit and POPGym Arcade, a hardware-accelerated Atari-like benchmark with MDP/POMDP twins sharing identical observation/actions. It presents formal tools such as the Observability Gap, Memory Bias, Recall Density, and Pixel Visualizations to disentangle memory effects from policy performance and to interpret how memory is used. Key findings include a memory-related bias where value can smear across irrelevant history (Value Smearing) and the demonstration that OOD observations can contaminate recurrent states, perturbing decisions far into the future, with implications for offline RL and sim-to-real transfer. The work enables controlled, high-throughput experiments and provides a foundation for fair memory evaluations and robust memory-aware RL research.

Abstract

How should we analyze memory in deep RL? We introduce mathematical tools for fairly analyzing policies under partial observability and revealing how agents use memory to make decisions. To utilize these tools, we present POPGym Arcade, a collection of Atari-inspired, hardware-accelerated, pixel-based environments sharing a single observation and action space. Each environment provides fully and partially observable variants, enabling counterfactual studies on observability. We find that controlled studies are necessary for fair comparisons, and identify a pathology where value functions smear credit over irrelevant history. With this pathology, we demonstrate how out-of-distribution scenarios can contaminate memory, perturbing the policy far into the future, with implications for sim-to-real transfer and offline RL.

Paper Structure

This paper contains 108 sections, 15 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Observations from our environment twins. All environments share a unifying observation and action space, enabling counterfactual studies with observability as the sole independent variable.
  • Figure 2: We disentangle the return using our memory analysis tools. We plot the POMDP returns $\in [0, 1]$, the Observability Gap, and Memory Bias. We aggregate scores over all environments and difficulty configurations. Whiskers represent the 95% confidence interval over five seeds.
  • Figure 3: How do policies assign credit? We estimate recall density $\mathbb{E}_{\pi, f}[\delta_Q(\mathbf{x}, \tau)]$ for the start, middle, and end of a trajectory, aggregating across models and seeds. All density for MDPs should be in $0.66 \leq \tau < 1.0$. Instead, we see credit diffusely distributed across trajectories across models and tasks.
  • Figure 4: How do fully-trained agents use memory? We plot pixelwise memory gradients (\ref{['eq:pixel_gradient']}) for the LRU (top rows) and GRU (bottom rows). In these MDPs, $V_*(s_{t})$ is independent of $s_{t-k} \dots s_{t-1}$, yet memory incorrectly smears value credit over uninformative past states, even with a residual connection bypassing the memory model. Smeared value attribution suggests that value estimators may not generalize to new trajectories.
  • Figure 5: OOD scenarios contaminate the recurrent state and perturb the policy. We add noise $\epsilon$ into one selected observation and examine how this perturbation propagates into current values and actions. The rightmost columns plot the relative $Q$ values ($A$), and we color the policy action. We see that not only do the future values change significantly, but so does the policy action.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Definition 4.1: Observability Gap
  • Definition 4.2
  • Definition 4.3: Recall Density