An introduction to reinforcement learning for neuroscience

Kristopher T. Jensen

An introduction to reinforcement learning for neuroscience

Kristopher T. Jensen

TL;DR

This review surveys reinforcement learning as a framework for understanding learning and decision making in neuroscience, tracing from classical temporal-difference and Q-learning to model-based, model-free hybrids, and advancing to deep RL, distributional RL, and meta-reinforcement learning. It highlights neural correlates such as dopamine reward-prediction error signals, hippocampal predictive maps, and prefrontal dynamics, illustrating how computational ideas map onto brain circuits. Key contributions include clarifying the successor representation as a bridge between MB and MF learning, summarizing how distributional RL aligns with biological signals, and connecting meta-learning concepts to cortical circuitry. The discussion identifies open questions—particularly for ethologically relevant tasks and generalist brain-inspired models—and argues for integrating multiple learning strategies with data-constrained, hierarchical architectures to capture the richness of biological learning and decision making.

Abstract

Reinforcement learning (RL) has a rich history in neuroscience, from early work on dopamine as a reward prediction error signal (Schultz et al., 1997) to recent work proposing that the brain could implement a form of 'distributional reinforcement learning' popularized in machine learning (Dabney et al., 2020). There has been a close link between theoretical advances in reinforcement learning and neuroscience experiments throughout this literature, and the theories describing the experimental data have therefore become increasingly complex. Here, we provide an introduction and mathematical background to many of the methods that have been used in systems neroscience. We start with an overview of the RL problem and classical temporal difference algorithms, followed by a discussion of 'model-free', 'model-based', and intermediate RL algorithms. We then introduce deep reinforcement learning and discuss how this framework has led to new insights in neuroscience. This includes a particular focus on meta-reinforcement learning (Wang et al., 2018) and distributional RL (Dabney et al., 2020). Finally, we discuss potential shortcomings of the RL formalism for neuroscience and highlight open questions in the field. Code that implements the methods discussed and generates the figures is also provided.

An introduction to reinforcement learning for neuroscience

TL;DR

Abstract

Paper Structure (17 sections, 32 equations, 6 figures)

This paper contains 17 sections, 32 equations, 6 figures.

Introduction
Problem setting
Temporal difference learning
Q-learning
Model-free and model-based reinforcement learning
The successor representation
Deep reinforcement learning
Distributional reinforcement learning
Learning from scalar rewards?
Discussion
Additional topics of interest
Hierarchical reinforcement learning
Off-policy & offline reinforcement learning
Imitation learning
Linear reinforcement learning
...and 2 more sections

Figures (6)

Figure 1: The reinforcement learning problem and cliffworld environment.(A) An agent (here the bird) interacts with the world to maximize reward. This involves a balance between exploring potentially interesting new states (e.g. searching for food in a new field) while also exploiting states known to yield high reward (e.g. the field that had many worms yesterday). At a given point in time, the bird is in some state $s_t$ from which it can take an action $a_t$, with the probability of different actions determined by the 'policy' $\pi(a|s_t)$, which is controlled by the agent. $a_t$ then leads to a change in the environment according to the non-controllable environment dynamics $s_{t+1}, r_t \sim p(s, r | s_t, a_t)$. Here, $r_t$ is the empirical 'reward' received by the agent, and its objective is to collect as much cumulative reward as possible. Often, reinforcement learning problems are divided into 'episodes', with the agent learning over the course of multiple repeated exposures to the environment. This could for example consist of the bird learning over multiple days which fields are likely to be rich in food, while minimizing the distance travelled and exposure to predators. (B) The 'cliffworld' environment, which will be used to demonstrate the performance and behaviour of a range of reinforcement learning algorithms in this work. The agent starts in the lower left corner (location [0, 0]), and the episode finishes when it encounters either the 'cliff' (dark blue) or the goal (yellow; location [9,0]). If the agent walks off the cliff, it receives a reward of -100. If it finds the goal, it receives a reward of +50. In any other state, it receives a reward of -1. Such negative rewards for 'neutral' actions are commonly used to encourage the agent to achieve its goal as fast as possible. The arrows indicate the 'optimal' policy, which takes the agent to the goal via the shortest possible route that avoids the cliff.
Figure 2: Temporal difference learning.(A) Value functions aquired through temporal difference learning (\ref{['eq:TD-learning']}) while acting according to either a random (top) or an optimal (bottom) policy. These simulations were performed with a random start state in the cliffworld environment to ensure full coverage of the space. Dark blue indicates negative expected reward (-100) and yellow indicates positive expected reward (+50). These simulations used a learning rate of $\alpha = 0.05$ and no temporal discounting ($\gamma = 1$). Under the random policy, states near the cliff have low value even if they are close to the goal, since the agent often falls off the cliff from there. Under the optimal policy, all states have high expected reward, since the agent always reaches the goal. States nearer the goal have slightly higher value than those further away. (B) Empirical reward as a function of episode number for a TD-learning agent that acts according to \ref{['eq:value_action_selection']} while updating its value estimates according to \ref{['eq:TD-learning']}. For this agent, action selection assumes access to a 'one-step' world model in order to evaluate the consequence of each putative action. The agent gradually converges to an optimal policy. Parameters for the agent are as in (A), except that the start state is always the lower left corner. (C) TD error (\ref{['eq:TD-learning']}) as a function of the step number along the optimal path for the agent in (B) at different stages of learning (green to blue). This TD signal gradually propagates backwards from the reward to preceding states, mirroring biological recordings of dopamine activity schultz1997neural. (D) Value function learned by a greedy TD agent as in (B), plotted either early (top) or late (bottom) in training. Early in training, the agent has learned that the cliff is bad but doesn't know where the goal is or how to get there. Late in training, the agent has learned a value function that locally resembles the optimal value function from (A), while it has not learned the value of distant states that are rarely or never visited from the start state. This is a potential shortcoming of 'greedy' agents that can easily converge to a sub-optimal local maximum in more complicated environments. For this analysis, we used a high learning rate of $\alpha = 0.5$ to make the early TD updates larger and therefore more visible.
Figure 3: Q-learning.(A) Empirical reward as a function of episode number for Q-learners with different levels of stochasticity in their policy ($\epsilon \in \{0, 0.1, 0.2\}$; legend). For these simulations, we used a learning rate of $\alpha = 0.05$ for all agents and no temporal discounting ($\gamma = 1$). The agent with $\epsilon = 0$ converges to an optimal policy, similar to the TD agent in \ref{['fig:TD']}A. However, convergence is in this case slower despite using the same learning rate, because the Q-learner has to learn about each action independently, while the TD agent used its one-step world model to aggregate learning across actions reaching the same state. In this cliffworld environment, increasing epsilon leads to worse performance since it increases the probability of falling off the cliff. Additionally, there is no risk of getting stuck in a local minimum since there is only one rewarding state, which decreases the value of exploration. Lines and shading indicate mean and standard error across 10 simulations. (B) As in (A), now for a non-cliffworld grid environment with two goals: one with a reward of +20 at location (0, 4), and one with a reward of +50 at location (5,0). In this case, having non-zero epsilon can increase the probability of discovering the 'high reward' goal rather than getting stuck with a locally optimal policy of moving to the 'low reward' goal. In these simulations, we used a learning rate of $\alpha = 1$, since this effect is less robust with lower learning rates that lead to more exploration of the environment across all agents. (C) Cliffworld policy learned by a Q-learning (top) or SARSA (bottom) agent with $\epsilon = 0.3$. Colours indicate the maximum value of any action in a state from blue (-100) to yellow (+50), and arrows indicate which action has the highest value. The Q-learning agent learns to move right above the cliff, because this is the optimal thing to do under the assumption that subsequent actions are also optimal. This is because it is an 'off-policy' algorithm that does not take into account the actual policy of the agent. In contrast, the SARSA agent learns to move a 'safe distance' away from the cliff, since it is an 'on-policy' algorithm that takes into account the finite probability of the agent choosing to move off the cliff from upcoming states. Q-learning agents are also frequently trained using a stochastic $\epsilon$-greedy policy and then evaluated with the greedy policy corresponding to $\epsilon = 0$, or they can be trained while 'annealing' $\epsilon$ from some finite value to $0$ over several episodes to allow for initial exploration.
Figure 4: Model-based reinforcement learning.(A) Learning curves for model-free (MF) and model-based (MB) RL agents. The MB agent used depth-first search to compute an optimal path at each decision point, gradually learning the reward and transition functions while exploring the environment. The MF agent was a Q-learning agent with $\epsilon = 0$ and learning rate $\alpha = 1$. (B) Wallclock time needed to run 100 episodes of cliffworld with either the MB or MF agents from (A), as a function of the length of the environment. While the MB agent required less experience to learn a good policy, the wallclock time per episode was much larger than for the MF agent. This illustrates an important balance between model-based and model-free reinforcement learning, where MF methods usually require more experience but MB methods require more compute at decision time. (C) Learning curve for an agent using the successor representation (SR) together with learning curves for the model-based agent in (A) and the greedy TD-agent from \ref{['fig:TD']}. The goal was moved from location (9, 0) to location (0, 4) at episode 40 (vertical black line), and location (9, 0) was instead given a reward of -5. The MB and SR agents had their reward functions updated to reflect this change and rapidly adapted their policies, while the TD agent had no such mechanism for robustness to changing reward functions. Reward curves were convolved with a Gaussian kernel ($\sigma = 3$ episodes), which is why performance appears to decrease slightly before episode 40. The TD and SR agents were assumed to have access to a 1-step world model at initialization, while the MB agent learned the transition structure from experience. (D) SR agents cannot always adapt to new reward functions if the newly rewarded states have low probablity under the old policy. Left column: Value function for an agent that learned an initial policy in an environment with a small reward in the upper left corner and intermediate reward in the upper right corner. The middle top and bottom states are 'cliffs'. The agent learned to make an initial rightward choice (grey arrows). Right column: A large reward was introduced in either the top left corner (top row) or bottom left corner (bottom row) and the value function recomputed (\ref{['eq:SR_recompute']}). The agent was unable to adapt to a large reward in the bottom left corner, since the old policy had low probability of reaching this state, even after initially going to the left. This results in a low expected value for going left from the start state (red circle), and a suboptimal policy that continues to go right (red arrow). (E) Learning curve for a standard Q-learning agent (blue) or Dyna agents that perform different numbers of Q-value updates after each physical action (legend). These Dyna updates used cached experience rather than data from a learned world model. Dyna agents make better use of limited experience at the cost of increased compute (proportional to the number of updates).
Figure 5: Meta-reinforcement learning. The results in this figure reproduce some of the analyses in Figure 1 of wang2018prefrontal. (A) We trained a recurrent meta-reinforcement learning agent in a two-armed bandit task, where the reward probabilities of each arm were sampled independently from $\mathcal{U}(0, 1)$ at the beginning of each episode and remained fixed throughout the episode. A recurrent neural network was trained across many episodes with different reward probabilities using an actor-critic algorithm. The input to the agent consisted of the previous action, the previous reward, and the time-within-trial. The average reward per episode is plotted against the episode number, showing that the agent gradually learns to adapt within each episode to the particular instantiation of the bandit task. Importantly, the parameters of the network are fixed within an episode, meaning that this adaptation occurs through the recurrent dynamics. Dashed horizontal lines indicate the reward of an agent selecting random actions and an 'oracle' agent that always chooses the best arm. (B) Heatmap showing example behaviour of the agent in episodes with different reward probabilities for the first arm, $p(r | a = 1)$. For the analysis here and in (C), we set $p(r | a = 2) = 1 - p(r | a = 1)$. Across episodes, the agent experiments with different actions and eventually converges on the optimal action. For episodes with more similar reward probabilities (near the middle), it takes longer to identify the optimal action. This balance between exploration and exploitation is mediated by the recurrent network dynamics, which are learned over many episodes using deep reinforcement learning. (C) We averaged the hidden state of the RNN over 100 episodes for each of several different reward probabilities, ranging from low (green) to high (blue) $p(r | a = 1)$. We then performed PCA on the resulting matrix of average hidden states to compute a low-dimensional trajectory over the course of an episode for each reward probability. This two-dimensional embedding of neural activity converges to different regions of state space during the episode for different reward probabilities. Black cross indicates the hidden state at the beginning of an episode, and coloured points indicate the final hidden state in an episode for the different reward probabilities.
...and 1 more figures

An introduction to reinforcement learning for neuroscience

TL;DR

Abstract

An introduction to reinforcement learning for neuroscience

Authors

TL;DR

Abstract

Table of Contents

Figures (6)