Table of Contents
Fetching ...

Temporal-Difference Learning Using Distributed Error Signals

Jonas Guan, Shon Eduard Verch, Claas Voelcker, Ethan C. Jackson, Nicolas Papernot, William A. Cunningham

TL;DR

A new deep Q-learning algorithm, Artificial Dopamine, is designed to computationally demonstrate that synchronously distributed, per-layer TD errors may be sufficient to learn surprisingly complex RL tasks.

Abstract

A computational problem in biological reward-based learning is how credit assignment is performed in the nucleus accumbens (NAc). Much research suggests that NAc dopamine encodes temporal-difference (TD) errors for learning value predictions. However, dopamine is synchronously distributed in regionally homogeneous concentrations, which does not support explicit credit assignment (like used by backpropagation). It is unclear whether distributed errors alone are sufficient for synapses to make coordinated updates to learn complex, nonlinear reward-based learning tasks. We design a new deep Q-learning algorithm, Artificial Dopamine, to computationally demonstrate that synchronously distributed, per-layer TD errors may be sufficient to learn surprisingly complex RL tasks. We empirically evaluate our algorithm on MinAtar, the DeepMind Control Suite, and classic control tasks, and show it often achieves comparable performance to deep RL algorithms that use backpropagation.

Temporal-Difference Learning Using Distributed Error Signals

TL;DR

A new deep Q-learning algorithm, Artificial Dopamine, is designed to computationally demonstrate that synchronously distributed, per-layer TD errors may be sufficient to learn surprisingly complex RL tasks.

Abstract

A computational problem in biological reward-based learning is how credit assignment is performed in the nucleus accumbens (NAc). Much research suggests that NAc dopamine encodes temporal-difference (TD) errors for learning value predictions. However, dopamine is synchronously distributed in regionally homogeneous concentrations, which does not support explicit credit assignment (like used by backpropagation). It is unclear whether distributed errors alone are sufficient for synapses to make coordinated updates to learn complex, nonlinear reward-based learning tasks. We design a new deep Q-learning algorithm, Artificial Dopamine, to computationally demonstrate that synchronously distributed, per-layer TD errors may be sufficient to learn surprisingly complex RL tasks. We empirically evaluate our algorithm on MinAtar, the DeepMind Control Suite, and classic control tasks, and show it often achieves comparable performance to deep RL algorithms that use backpropagation.

Paper Structure

This paper contains 34 sections, 2 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Simplified illustration of dopamine distribution in the NAc. Dopamine is synthesized in the VTA and transported along axons to the NAc, where it is picked up by receptors in medium spiny neurons. Dopamine concentrations (error signals) are locally homogenous, but can vary across regions. Connections between NAc neurons not shown.
  • Figure 2: Network architecture of a 3-layer AD network. $h_{t}^{[l]}$ represents the activations of layer $l$ at time $t$, and $s_{t}$ the input state. The blocks are AD cells, as shown in Figure \ref{['fig:cell_architecture']}. Similar to how dopamine neurons compute and distribute error used by a local region, each cell computes its own local TD error used by its updates; errors do not propagate across layers. To relay information, upper layers send activations to lower layers in the next timestep. For example, red shows all active connections at $t=1$.
  • Figure 3: Inner workings of our proposed AD cell (i.e., hidden layer). $h_{t}^{[l]}$ is the activations of the cell $l$ at time $t$, and $\hat{Q}_{t}^{[l]}$ is a vector of Q-value predictions given the current state and each action. We compute the cell's activations $h_{t}^{[l]}$ using a ReLU weight layer, then use an attention-like mechanism to compute $\hat{Q}_{t}^{[l]}$. Specifically, we obtain $\hat{Q}_{t}^{[l]}$ by having the cell's tanh weight layers, one for each action, compute attention weights that are then applied to $h_{t}^{[l]}$. Each cell computes its own error.
  • Figure 4: Episodic returns of AD in MinAtar and DMC environments, compared to DQN, TD-MPC2 and SAC. Lines show the mean return over 10 seeds and the shaded area conforms to 3 standard errors. The axes are return and environmental steps.
  • Figure 5: Ablation study comparing the performance of AD against AD without the forward-in-time connections, and a single-layer AD cell. In Seaquest and Asterix, AD achieves qualitatively stronger performance. In Seaquest the line for AD single layer is overlapped by the line for AD no forward.
  • ...and 5 more figures