Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

Esraa Elelimy; Adam White; Michael Bowling; Martha White

Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

Esraa Elelimy, Adam White, Michael Bowling, Martha White

TL;DR

Recurrent Trace Units (RTUs) are introduced, a small modification on LRUs that are nonetheless find to have significant performance benefits over LRUs when trained with RTRL, and significantly outperform other recurrent architectures across several partially observable environments while using significantly less computation.

Abstract

Recurrent Neural Networks (RNNs) are used to learn representations in partially observable environments. For agents that learn online and continually interact with the environment, it is desirable to train RNNs with real-time recurrent learning (RTRL); unfortunately, RTRL is prohibitively expensive for standard RNNs. A promising direction is to use linear recurrent architectures (LRUs), where dense recurrent weights are replaced with a complex-valued diagonal, making RTRL efficient. In this work, we build on these insights to provide a lightweight but effective approach for training RNNs in online RL. We introduce Recurrent Trace Units (RTUs), a small modification on LRUs that we nonetheless find to have significant performance benefits over LRUs when trained with RTRL. We find RTUs significantly outperform other recurrent architectures across several partially observable environments while using significantly less computation.

Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

TL;DR

Abstract

Paper Structure (40 sections, 3 theorems, 61 equations, 27 figures, 1 table, 1 algorithm)

This paper contains 40 sections, 3 theorems, 61 equations, 27 figures, 1 table, 1 algorithm.

Introduction
Background
Recurrent Trace Units
Revisiting Complex-valued Diagonal Recurrence
The RTU Parameterization
The RTRL Update for RTUs
Contrasting to LRUs
Online Prediction Learning
Ablation Study on Architectural Choices for RTUs and LRUs
Learning under resources constraints
Real-Time Recurrent Policy Gradient
Linear RTRL Methods in Incremental and Batch Settings
Experiments in Memory-Based Control
Conclusion and Limitations
Acknowledgments
...and 25 more sections

Key Result

Proposition B.1

Assume $f \circ \mathbf{P} = \mathbf{P} \circ f$ for any full rank, potentially complex-valued $\mathbf{P} \in \mathbb{C}^{n \times n}$ with unit-length column vectors. Then given any $\mathbf{W}_{h}$ and $\mathbf{W}_{x}$ for Equation eq: linear_rnn, there is a corresponding complex-valued diagonal where $\overline{\mathbf{h}}_{t} \in \mathbb{C}^{n}$ is a linear transformation of $\mathbf{h}_{t}

Figures (27)

Figure 1: Ablation over different architectural choices for RTUs and LRUs. The RTU variants are blue, and the LRU variants are orange. In each subplot, we restrict both architectures in a particular way, reporting prediction error (MSRE) as a function of hidden state size. Across variations, RTUs are often better and, at worst, tie LRU. Here, both architectures were using RTRL.
Figure 2: Learning under resources constraints in Trace Conditioning. Each of the four subplots shows how each algorithm's performance varies as a function of resources. (a) LRU and GRU with T-TBTT is not competitive with RTUs even as $T$ is increased while restricting the number of hidden units in LRU and GRU so that all algorithms use about the same computation per step. (b) If we allow GRU and LRU's computation to increase (fixed network size) while increasing $T$, the performance gap remains. (c) Fixing $T$ to a large value to solve the task, we can increase the number of parameters, holding the computation equal for all methods. (d) If we do not require compute to be equal across methods as we scale parameters, then the LRU can eventually match the error of RTU, but GRU cannot. The black dashed line represents the near perfect prediction performance.
Figure 3: Contrasting runtime in incremental and batch settings. In the incremental setting, evaluated in the animal-learning prediction task, T-BPTT updates scale with truncation length, whereas linear RTRL is constant. With batch updates, evaluated in Ant-P with PPO, linear RTRL remains linear and T-BPTT is slightly more efficient.
Figure 4: Learning curves on the Mujoco POMDP benchmark. Environments with -P mean that velocity components are occluded from the observations, while -V means that the positions and angles are occluded. All architectures have the same number of recurrent parameters ($~24$k parameter). For each architecture, we show the performance of its best-tuned variant.
Figure 5: Reacher, $30$ runs with standard errors.
...and 22 more figures

Theorems & Definitions (7)

Proposition B.1
proof
Definition C.1
Lemma C.2
proof
Theorem C.3
proof

Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

TL;DR

Abstract

Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (27)

Theorems & Definitions (7)