Table of Contents
Fetching ...

Real-Time Recurrent Reinforcement Learning

Julian Lemmel, Radu Grosu

TL;DR

This work tackles learning in partially observable environments by proposing Real-Time Recurrent Reinforcement Learning (RTRRL), a biologically plausible RL framework that operates online without backpropagation through time. RTRRL combines a Meta-RL RNN backbone with a TD($\lambda$) actor-critic loop and online gradient computation via RFLO or RTRL (or their LRUs), using random feedback alignment to avoid weight transport. The approach demonstrates competitive performance across POMDP benchmarks, memory tasks, and physics simulations, while offering insights into basal ganglia-like reward pathways and potential energy-efficient neuromorphic implementations. Overall, RTRRL provides a principled, online, neuroscience-inspired alternative to BPTT-based RL that remains effective in partially observable settings and aligns with biological learning principles.

Abstract

We introduce a biologically plausible RL framework for solving tasks in partially observable Markov decision processes (POMDPs). The proposed algorithm combines three integral parts: (1) A Meta-RL architecture, resembling the mammalian basal ganglia; (2) A biologically plausible reinforcement learning algorithm, exploiting temporal difference learning and eligibility traces to train the policy and the value-function; (3) An online automatic differentiation algorithm for computing the gradients with respect to parameters of a shared recurrent network backbone. Our experimental results show that the method is capable of solving a diverse set of partially observable reinforcement learning tasks. The algorithm we call real-time recurrent reinforcement learning (RTRRL) serves as a model of learning in biological neural networks, mimicking reward pathways in the basal ganglia.

Real-Time Recurrent Reinforcement Learning

TL;DR

This work tackles learning in partially observable environments by proposing Real-Time Recurrent Reinforcement Learning (RTRRL), a biologically plausible RL framework that operates online without backpropagation through time. RTRRL combines a Meta-RL RNN backbone with a TD() actor-critic loop and online gradient computation via RFLO or RTRL (or their LRUs), using random feedback alignment to avoid weight transport. The approach demonstrates competitive performance across POMDP benchmarks, memory tasks, and physics simulations, while offering insights into basal ganglia-like reward pathways and potential energy-efficient neuromorphic implementations. Overall, RTRRL provides a principled, online, neuroscience-inspired alternative to BPTT-based RL that remains effective in partially observable settings and aligns with biological learning principles.

Abstract

We introduce a biologically plausible RL framework for solving tasks in partially observable Markov decision processes (POMDPs). The proposed algorithm combines three integral parts: (1) A Meta-RL architecture, resembling the mammalian basal ganglia; (2) A biologically plausible reinforcement learning algorithm, exploiting temporal difference learning and eligibility traces to train the policy and the value-function; (3) An online automatic differentiation algorithm for computing the gradients with respect to parameters of a shared recurrent network backbone. Our experimental results show that the method is capable of solving a diverse set of partially observable reinforcement learning tasks. The algorithm we call real-time recurrent reinforcement learning (RTRRL) serves as a model of learning in biological neural networks, mimicking reward pathways in the basal ganglia.
Paper Structure (30 sections, 13 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 30 sections, 13 equations, 11 figures, 3 tables, 2 algorithms.

Figures (11)

  • Figure 1: RTRRL uses a Meta-RL RNN-backbone which receives observation $o_t$, previous action $a_{t-1}$ and reward $r_{t}$, computing the latent vector $h_t$ from which the action $a_t$ and the value estimate $\hat{v}_t$ are computed via linear functions.
  • Figure 2: Schematics showing how gradients are passed back to the RNN (yellow). Gradients of the actor (red) and critic (green) losses are propagated back towards $h_{t}$ and $h_{t-1}$ respectively.
  • Figure 3: Boxplot of the combined normalized validation rewards achieved for 5 runs each on a range of different environments from the gymnax, and popgym packages. Depicted are results for RTRRL with CTRNNs and LRUs, each with and without FA. RTRRL-LRU-Meta and PPO-CTRNN-Meta perform best overall. Using FA always leads to diminished performance. Fully biologically plausible RTRRL-RFLO with FA often achieves on-par results.
  • Figure 4: Left: Boxplots of 10 runs on MemoryChain per type of plasticity for increasing memory lengths. BPTT refers to PPO with LSTM, RFLO and RTRL denote the variants of RTRRL and LocalMSE is a naive approximation to RTRL. Right: Tuning the entropy rate is a trade-off of best possible reward vs. consistency as shown for the MetaMaze environment.
  • Figure 5: Left: Shown are the mean rewards aggregated over 5 runs each; shaded regions are the variance. While in many cases it does not make a difference, not using the Meta-RL architecture hampers performance in some cases. Using biologically plausible Feedback Alignment can lead to worse results, but more often than not it does not have a significant impact.
  • ...and 6 more figures