Real-Time Recurrent Reinforcement Learning

Julian Lemmel; Radu Grosu

Real-Time Recurrent Reinforcement Learning

Julian Lemmel, Radu Grosu

TL;DR

This work tackles learning in partially observable environments by proposing Real-Time Recurrent Reinforcement Learning (RTRRL), a biologically plausible RL framework that operates online without backpropagation through time. RTRRL combines a Meta-RL RNN backbone with a TD($\lambda$) actor-critic loop and online gradient computation via RFLO or RTRL (or their LRUs), using random feedback alignment to avoid weight transport. The approach demonstrates competitive performance across POMDP benchmarks, memory tasks, and physics simulations, while offering insights into basal ganglia-like reward pathways and potential energy-efficient neuromorphic implementations. Overall, RTRRL provides a principled, online, neuroscience-inspired alternative to BPTT-based RL that remains effective in partially observable settings and aligns with biological learning principles.

Abstract

We introduce a biologically plausible RL framework for solving tasks in partially observable Markov decision processes (POMDPs). The proposed algorithm combines three integral parts: (1) A Meta-RL architecture, resembling the mammalian basal ganglia; (2) A biologically plausible reinforcement learning algorithm, exploiting temporal difference learning and eligibility traces to train the policy and the value-function; (3) An online automatic differentiation algorithm for computing the gradients with respect to parameters of a shared recurrent network backbone. Our experimental results show that the method is capable of solving a diverse set of partially observable reinforcement learning tasks. The algorithm we call real-time recurrent reinforcement learning (RTRRL) serves as a model of learning in biological neural networks, mimicking reward pathways in the basal ganglia.

Real-Time Recurrent Reinforcement Learning

TL;DR

) actor-critic loop and online gradient computation via RFLO or RTRL (or their LRUs), using random feedback alignment to avoid weight transport. The approach demonstrates competitive performance across POMDP benchmarks, memory tasks, and physics simulations, while offering insights into basal ganglia-like reward pathways and potential energy-efficient neuromorphic implementations. Overall, RTRRL provides a principled, online, neuroscience-inspired alternative to BPTT-based RL that remains effective in partially observable settings and aligns with biological learning principles.

Abstract

Paper Structure (30 sections, 13 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 30 sections, 13 equations, 11 figures, 3 tables, 2 algorithms.

Introduction
Real-Time Recurrent RL
Continuous-Time RNN.
Linear Recurrent Units (LRUs).
The Meta-RL RNN Architecture.
Temporal-Difference Learning (TD).
Policy Gradient.
Eligibility Traces (ET).
Real-Time Recurrent Learning (RTRL).
Random Feedback Local Online Learning (RFLO).
Putting All Pieces Together
Experiments
Memory Length.
Ablation Experiments.
Physics Simulations.
...and 15 more sections

Figures (11)

Figure 1: RTRRL uses a Meta-RL RNN-backbone which receives observation $o_t$, previous action $a_{t-1}$ and reward $r_{t}$, computing the latent vector $h_t$ from which the action $a_t$ and the value estimate $\hat{v}_t$ are computed via linear functions.
Figure 2: Schematics showing how gradients are passed back to the RNN (yellow). Gradients of the actor (red) and critic (green) losses are propagated back towards $h_{t}$ and $h_{t-1}$ respectively.
Figure 3: Boxplot of the combined normalized validation rewards achieved for 5 runs each on a range of different environments from the gymnax, and popgym packages. Depicted are results for RTRRL with CTRNNs and LRUs, each with and without FA. RTRRL-LRU-Meta and PPO-CTRNN-Meta perform best overall. Using FA always leads to diminished performance. Fully biologically plausible RTRRL-RFLO with FA often achieves on-par results.
Figure 4: Left: Boxplots of 10 runs on MemoryChain per type of plasticity for increasing memory lengths. BPTT refers to PPO with LSTM, RFLO and RTRL denote the variants of RTRRL and LocalMSE is a naive approximation to RTRL. Right: Tuning the entropy rate is a trade-off of best possible reward vs. consistency as shown for the MetaMaze environment.
Figure 5: Left: Shown are the mean rewards aggregated over 5 runs each; shaded regions are the variance. While in many cases it does not make a difference, not using the Meta-RL architecture hampers performance in some cases. Using biologically plausible Feedback Alignment can lead to worse results, but more often than not it does not have a significant impact.
...and 6 more figures

Real-Time Recurrent Reinforcement Learning

TL;DR

Abstract

Real-Time Recurrent Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)