Recurrent Reinforcement Learning with Memoroids

Steven Morad; Chris Lu; Ryan Kortvelesy; Stephan Liwicki; Jakob Foerster; Amanda Prorok

Recurrent Reinforcement Learning with Memoroids

Steven Morad, Chris Lu, Ryan Kortvelesy, Stephan Liwicki, Jakob Foerster, Amanda Prorok

TL;DR

This work revisits the traditional approach to batching in recurrent reinforcement learning and uses memoroids to propose a batching method that improves sample efficiency, increases the return, and simplifies the implementation of recurrent loss functions in reinforcement learning.

Abstract

Memory models such as Recurrent Neural Networks (RNNs) and Transformers address Partially Observable Markov Decision Processes (POMDPs) by mapping trajectories to latent Markov states. Neither model scales particularly well to long sequences, especially compared to an emerging class of memory models called Linear Recurrent Models. We discover that the recurrent update of these models resembles a monoid, leading us to reformulate existing models using a novel monoid-based framework that we call memoroids. We revisit the traditional approach to batching in recurrent reinforcement learning, highlighting theoretical and empirical deficiencies. We leverage memoroids to propose a batching method that improves sample efficiency, increases the return, and simplifies the implementation of recurrent loss functions in reinforcement learning.

Recurrent Reinforcement Learning with Memoroids

TL;DR

Abstract

Paper Structure (45 sections, 3 theorems, 39 equations, 9 figures, 5 algorithms)

This paper contains 45 sections, 3 theorems, 39 equations, 9 figures, 5 algorithms.

Introduction
Contributions
Preliminaries
Rollouts, Causality, and Episode Boundaries
Partial Observability
Background and Related Work
The Shortcomings of Segments
Alternatives to Segments
On the Efficiency of Sequence Models
Monoids
Approach
Reformulating Existing Sequence Models
Accelerated Discounted Returns
Inline Recurrent State Resets
Tape-Based Batching
...and 30 more sections

Key Result

Theorem 4.2

All monoids $(H, \bullet, e_I)$ can be transformed into a resettable monoid $(G, \circ, g_I)$ defined as For a single episode, the $A$ term output by the operator $\circ$ is equivalent to the output of $\bullet$. Over multiple contiguous episodes, $\circ$ prevents information flow across episode boundaries.

Figures (9)

Figure 1: We visualize the Segment-Based Batching approach often used in prior literature. A worker collects a rollout of episodes, denoted by color. Each episode is split and zero-padded to produce a batch of segments, each with a constant, user-specified segment length $L$. Episodes exceeding the specified length are broken into multiple segments, preventing backpropagation through time from reaching earlier segments. Segments contain zero padding, reducing efficiency, biasing normalization methods, and necessitating padding-aware recurrent loss functions.
Figure 2: A visualization of sampling in TBB, with a batch size of $B = 4$. Transitions from rollouts are stored in-order in $\mathcal{D}$, with each color denoting a separate episodes. Associated episode begin indices are stored in $\mathcal{I}$. We sample a train batch by randomly selecting from $\mathcal{I}$. For example, we might sample $4$ from $\mathcal{I}$, corresponding to $E_1$ in red. Next, we sample $7$ from $\mathcal{I}$, corresponding to $E_2$ in red. We concatenate $\mathcal{B} = \textrm{concat}(E_1, E_2)$ and return the result as a train batch.
Figure 3: We demonstrate that SBB can hurt Q learning through truncated BPTT. We examine the Repeat Previous task, with $\textrm{RML} = 10$, comparing SBB (left) to TBB (right). For SBB, we set $L = \textrm{RML} = 10$ to capture all necessary information. After training, we plot the cumulative partial derivative with respect to the observations on the y-axis. This partial derivative determines the VML -- how much each prior observation contributes to the Q value. We draw a vertical red line at $L = \textrm{RML} = 10$. We see that across models, a majority of the Q value is not learnable when using SBB. Even when we set $L = \infty$ using TBB, we see that the VML still spans far beyond the RML. This surprising finding shows that truncated BPTT degrades recurrent value estimators.
Figure 4: We compare TBB (ours) to SBB across POPGym tasks and memory models, reporting the mean and 95% bootstrapped confidence interval of the evaluation return over ten seeds. We find that TBB significantly improves sample efficiency. See \ref{['fig:all_envs']} for more experiments.
Figure 5: (Left) We compare how long it takes to compute the discounted return using our memoroid, compared to the standard way of iterating through a batch. Computing the discounted return is orders of magnitude faster when using our memoroid implementation. (Right) we compare the total time to train a policy on Repeat First. For both experiments, we evaluate ten random seeds on a RTX 2080Ti GPU.
...and 4 more figures

Theorems & Definitions (9)

Definition 3.1
Definition 4.1
Theorem 4.2
proof
Theorem D.1
proof
Theorem E.1
proof
proof : Proof of Theorem \ref{['thm:reset']}

Recurrent Reinforcement Learning with Memoroids

TL;DR

Abstract

Recurrent Reinforcement Learning with Memoroids

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (9)