Lifelong Reinforcement Learning via Neuromodulation

Sebastian Lee; Samuel Liebana; Claudia Clopath; Will Dabney

Lifelong Reinforcement Learning via Neuromodulation

Sebastian Lee, Samuel Liebana, Claudia Clopath, Will Dabney

TL;DR

This work introduces an abstract framework for integrating theories and evidence from neuroscience and the cognitive sciences into the design of adaptive artificial reinforcement learning algorithms, and empirically validate the effectiveness of the resulting adaptive algorithm in a non-stationary multi-armed bandit problem.

Abstract

Navigating multiple tasks$\unicode{x2014}$for instance in succession as in continual or lifelong learning, or in distributions as in meta or multi-task learning$\unicode{x2014}$requires some notion of adaptation. Evolution over timescales of millennia has imbued humans and other animals with highly effective adaptive learning and decision-making strategies. Central to these functions are so-called neuromodulatory systems. In this work we introduce an abstract framework for integrating theories and evidence from neuroscience and the cognitive sciences into the design of adaptive artificial reinforcement learning algorithms. We give a concrete instance of this framework built on literature surrounding the neuromodulators Acetylcholine (ACh) and Noradrenaline (NA), and empirically validate the effectiveness of the resulting adaptive algorithm in a non-stationary multi-armed bandit problem. We conclude with a theory-based experiment proposal providing an avenue to link our framework back to efforts in experimental neuroscience.

Lifelong Reinforcement Learning via Neuromodulation

TL;DR

Abstract

Navigating multiple tasks

for instance in succession as in continual or lifelong learning, or in distributions as in meta or multi-task learning

requires some notion of adaptation. Evolution over timescales of millennia has imbued humans and other animals with highly effective adaptive learning and decision-making strategies. Central to these functions are so-called neuromodulatory systems. In this work we introduce an abstract framework for integrating theories and evidence from neuroscience and the cognitive sciences into the design of adaptive artificial reinforcement learning algorithms. We give a concrete instance of this framework built on literature surrounding the neuromodulators Acetylcholine (ACh) and Noradrenaline (NA), and empirically validate the effectiveness of the resulting adaptive algorithm in a non-stationary multi-armed bandit problem. We conclude with a theory-based experiment proposal providing an avenue to link our framework back to efforts in experimental neuroscience.

Paper Structure (23 sections, 12 equations, 7 figures)

This paper contains 23 sections, 12 equations, 7 figures.

Introduction
Background
Neuromodulation for Adaptive RL
What do Neuromodulatory Systems Do? (component I)
What Do Neuromodulatory Systems Signal? (component II)
Measuring Analogous Signals in Agent-Environment Interaction (component III)
Finding Functional Forms Back to Hyper-parameters (component IV)
Doya-DaYu Agent
Multi-Armed Bandit Experiment
Feedback to Neuroscience Experiments
Discussion
Proposals for Framework
Epistemic & Aleatoric Uncertainty
Uncertainty Estimation Methods
Experimental Details
...and 8 more sections

Figures (7)

Figure 1: Framework for using understanding of neuromodulatory systems to design adaptive RL algorithms.left The framework consists of four components: (I) a hypothesised connection between a neuromodulatory system and an RL algorithm hyper-parameter, i.e. what does the neuromodulator do?; (II) an understanding of what quantity the neuromodulator is believed to signal or measure; (III) an analogous quantity measurable as part of the artificial agent-environment interaction; (IV) a functional form mapping this estimate to a hyper-parameter value. right An example, with details given in the main text, of the framework applied to (I) a hypothesised connection between Noradrenaline (NA) and exploratory behaviour; (II) where NA is believed to signal unexpected uncertainty; (III) which can be understood as epistemic uncertainty and estimated with ensemble methods; (IV) providing a functional mapping to the inverse temperature hyper-parameter to modulate the exploration-exploitation trade-off.
Figure 2: $Q$-learning update and softmax policy labelled by the mapping between neuromodulators and hyperparameters proposed by doya2002metalearning.
Figure 3: Multi-Armed Bandit Evaluation. 210 seeds; $k=5$, $p=0.4$, $M=500$, $\mu_{\min}=-5$, $\mu_{\max}$=5, $\sigma_{\min}=0.001$, $\sigma_{\max}=2$. Dark lines show the mean; background traces show a random subset of individual seeds. A (top) schematic of the $k$-armed bandit: the agent must choose from one of $k$ arms with normally distributed payouts. (bottom) Example payout distributions in our multi-armed bandit task. Each colour represents a different arm. Solid/dashed lines show distributions before/after a context switch. Agents must model these context switches and explore to find the new high payout arm, while also navigating changing payout variances. B Average regret per context; while regret is initially higher in new contexts for Doya-DaYu, its final performance is superior. (inset) final average regret box plot; braces marked *, **, *** indicate $p<0.1$, $p<0.01$, and $p<0.001$ estimated via an initial one-way ANOVA followed by a Tukey HSD test. C, D Learning rates and temperatures adapt to spike at context switches and then decay. E Fraction of arm pulls that are optimal averaged over all context switches. On average, the full oracle Doya-DaYu agent selects the best arm significantly more than baseline models by the end of contexts. F Cumulative regret is lower for Doya-DaYu agents than Discounted-UCB and Boltzmann policies, especially as the number of context switches increases, and the space of payoff distributions encountered widens. G Mean squared errors for value estimates over the course of learning. Different levels of exploration lead to different distributions of errors over the arms; having low estimation error on sub-optimal arms is not directly beneficial. It does however suggest that baseline algorithms are over-exploring in this task, while the full oracle has high average MSE over all arms. Further results, details of implementations including procedures for hyper-parameter optimisation in the baselines can be found in \ref{['app: exp_details']}
Figure 4: Neuromodulation Framework as an Experimental Design Paradigm. The exploratory branch (starting in the top-left) where machine learning algorithms are proposed on the basis of experimental observations along ($\mathcal{T}_{\text{BIO}}$, $\Theta_{\text{BIO}}$, $\mathcal{M}_{\text{BIO}}$) axes. The confirmatory branch (starting in the bottom-right) is a theory-based experimental pipeline in which a model of animal learning, $\Theta_{\text{AI}}$, gives rise to a theory of the brain (see \ref{['app: other_inputs']} for discussion of 'other inputs'). Theory can be verified by mapping experiments from $\text{AI}$ to $\text{BIO}$ axes via our neuromodulatory framework.
Figure 5: Concrete Experimental Proposal(a) Proposed task to mirror our multi-armed bandit task in \ref{['sec: experiments']}; on each trial---following a 'go' cue---mice lick from one of two ports, after which they are given water reward sampled from some underlying distribution for the port selected. Upon reaching proficiency in selecting the more rewarded port, a new trial 'block' with new distributions begins. This paradigm is similar to those used e.g. by duan2021cortico and sabatini_two_armed. (b)Experiment timeline: the experiment is divided into blocks of trials where one of the two arms has a higher expected reward $\mu$. The stimulation protocol is designed such that stimulation epochs occur through a reversal period, and are counter-balanced across side reversals. (c)Trial timeline: Phasic stimulation is applied from 'go' cue to after port choice to probe the influence of manipulating NA levels on behaviour. (d)Neural stimulation/recording schematic: An optic fiber over Locus Coeruleus (LC) enables optical inhibition/excitation of NA neurons; the fiber over Ventral Tegmental Area (VTA) allows for confirming changes in NA concentration upon laser stimulation; and the electrode in striatum, believed to encode action values for decision-making, enables recording of striatal activity throughout the task.
...and 2 more figures

Lifelong Reinforcement Learning via Neuromodulation

TL;DR

Abstract

Lifelong Reinforcement Learning via Neuromodulation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)