Backpropagation through space, time, and the brain

Benjamin Ellenberger; Paul Haider; Jakob Jordan; Kevin Max; Ismael Jaras; Laura Kriener; Federico Benitez; Mihai A. Petrovici

Backpropagation through space, time, and the brain

Benjamin Ellenberger, Paul Haider, Jakob Jordan, Kevin Max, Ismael Jaras, Laura Kriener, Federico Benitez, Mihai A. Petrovici

TL;DR

This work introduces Generalized Latent Equilibrium (GLE), an energy-based framework for fully local spatio-temporal credit assignment in continuous-time neural networks. By defining neuron-local mismatches and evolving both neuronal and synaptic parameters to minimize a global energy, GLE achieves a real-time, online approximation of backpropagation through space and time while maintaining locality in both space and time. The framework leverages retrospective (memory-like) and prospective (prediction-like) processing, organized in a cortical-moments-compatible microcircuit with distinct representation and error streams, and demonstrates strong empirical performance on online MNIST-1D and speech-command tasks, as well as purely spatial CIFAR-10 and chaotic Mackey-Glass time-series prediction. The results highlight GLE’s potential for neuroscience as a principled account of spatio-temporal inference in the brain and for neuromorphic hardware, where online, local learning with diverse time scales can be highly advantageous.

Abstract

How physical networks of neurons, bound by spatio-temporal locality constraints, can perform efficient credit assignment, remains, to a large extent, an open question. In machine learning, the answer is almost universally given by the error backpropagation algorithm, through both space and time. However, this algorithm is well-known to rely on biologically implausible assumptions, in particular with respect to spatio-temporal (non-)locality. Alternative forward-propagation models such as real-time recurrent learning only partially solve the locality problem, but only at the cost of scaling, due to prohibitive storage requirements. We introduce Generalized Latent Equilibrium (GLE), a computational framework for fully local spatio-temporal credit assignment in physical, dynamical networks of neurons. We start by defining an energy based on neuron-local mismatches, from which we derive both neuronal dynamics via stationarity and parameter dynamics via gradient descent. The resulting dynamics can be interpreted as a real-time, biologically plausible approximation of backpropagation through space and time in deep cortical networks with continuous-time neuronal dynamics and continuously active, local synaptic plasticity. In particular, GLE exploits the morphology of dendritic trees to enable more complex information storage and processing in single neurons, as well as the ability of biological neurons to phase-shift their output rate with respect to their membrane potential, which is essential in both directions of information propagation. For the forward computation, it enables the mapping of time-continuous inputs to neuronal space, effectively performing a spatio-temporal convolution. For the backward computation, it permits the temporal inversion of feedback signals, which consequently approximate the adjoint variables necessary for useful parameter updates.

Backpropagation through space, time, and the brain

TL;DR

Abstract

Paper Structure (50 sections, 71 equations, 17 figures, 3 tables, 2 algorithms)

This paper contains 50 sections, 71 equations, 17 figures, 3 tables, 2 algorithms.

Introduction
Results
The GLE framework
Network dynamics
GLE dynamics implement a real-time approximation of AM/BPTT
Cortical / neuromorphic circuits
A minimal GLE example
Small GLE networks
Challenging spatio-temporal classification
MNIST-1D
GLE for purely spatial problems
Scaling, noise and symmetry
Chaotic time series prediction
Discussion
Connection to related approaches
...and 35 more sections

Figures (17)

Figure 1: The problem of locality in spatio-temporal credit assignment.(a) To illustrate the different learning algorithms, we consider three neurons within a larger recurrent network. The neuron indices are indicative of the distance from the output, with neuron $i+1$ being itself an output neuron, and therefore having direct access to an output error $e_{i+1}$. (b) Information needed by a deep synapse at time $t$ to calculate an update $\dot w_{i-1, k}^{(t)}$. Orange: future-facing algorithms such as BPTT require the states $r_{n}^{(t^{+})}$ of all future times $t^+$ and all neurons $n$ in the network and can therefore only be implemented in an offline fashion. These states are required to calculate future errors $e_n^{(t^+)}$, which are then propagated back in time into present errors $e_n^{(t)}$ and used for synaptic updates $\dot w_{i-1,k}^{(t)} \propto e_{i-1}^{(t)} r_k^{(t)}$. Purple: past-facing algorithms such as RTRL store past effects of all synapses $w_{jk}^{(t^{-})}$ on all past states $r_n^{(t^-)}$ in an influence tensor $M_{n,j,k}^{(t^-)}$. This tensor can be updated online and used to perform weight updates $\dot w_{i-1,k}^{(t)} \propto \sum_{n} e_{n}^{(t)} M_{n,i-1,k}^{(t)}$. Note that all synapse updates need to have access to distant output errors. Furthermore, the update of each element in the influence tensor requires the knowledge of distant elements and is thus itself nonlocal in space. Green: GLE operates exclusively on present states $r_n^{(t)}$. It uses them to infer errors $e_n^{(t)}$ that approximate the future backpropagated errors of BPTT.
Figure 2: Comparison between and . Network dynamics define trajectories (black) in the cost/energy landscape, spanned by external inputs $I$ and neuron outputs $r$. Parameter updates (red, here: synaptic weights) reduce the cost/energy along these trajectories. (a) records the trajectory between two points in time and calculates the total update $\Delta W$ that reduces the integrated cost along this trajectory. (b) calculates an approximate cost gradient at every point in time, by taking into account past network states (via retrospective coding, $\IfNoValueTF{-NoValue-} { \IfNoValueTF{-NoValue-} {{\mathcal{I}}^{-}_{\tau}} {{\mathcal{I}}^{-}_{\tau} \left\{ -NoValue- \right\}} } { \IfNoValueTF{-NoValue-} {{\mathcal{I}}^{-}_{\tau}} {{\mathcal{I}}^{-}_{\tau} \left\{ -NoValue- \right\}} }$) and estimating future errors from the current state (via prospective coding, $\IfNoValueTF{-NoValue-} { \IfNoValueTF{-NoValue-} {{\mathcal{D}}^{+}_{\tau}} {{\mathcal{D}}^{+}_{\tau} \left\{ -NoValue- \right\}} } { \IfNoValueTF{-NoValue-} {{\mathcal{D}}^{+}_{\tau}} {{\mathcal{D}}^{+}_{\tau} \left\{ -NoValue- \right\}} }$). Learning is thus fully online and can gradually reduce the energy in real-time, with the (real) trajectory slowly dropping away from the (virtual) trajectory of a network that is not learning (dashed line).
Figure 3: Comparison of and / in Fourier space.(a) Effect of the individual and combined GLE operators $\IfNoValueTF{-NoValue-} { \IfNoValueTF{-NoValue-} {{\mathcal{I}}^{-}_{\tau}} {{\mathcal{I}}^{-}_{\tau} \left\{ -NoValue- \right\}} } { \IfNoValueTF{-NoValue-} {{\mathcal{I}}^{-}_{\tau}} {{\mathcal{I}}^{-}_{\tau} \left\{ -NoValue- \right\}} }$ and $\IfNoValueTF{-NoValue-} { \IfNoValueTF{-NoValue-} {{\mathcal{D}}^{+}_{\tau}} {{\mathcal{D}}^{+}_{\tau} \left\{ -NoValue- \right\}} } { \IfNoValueTF{-NoValue-} {{\mathcal{D}}^{+}_{\tau}} {{\mathcal{D}}^{+}_{\tau} \left\{ -NoValue- \right\}} }$ with shared time constant $\tau$ on a single frequency component of an input current $I$. $\IfNoValueTF{-NoValue-} { \IfNoValueTF{-NoValue-} {{\mathcal{I}}^{-}_{\tau}} {{\mathcal{I}}^{-}_{\tau} \left\{ -NoValue- \right\}} } { \IfNoValueTF{-NoValue-} {{\mathcal{I}}^{-}_{\tau}} {{\mathcal{I}}^{-}_{\tau} \left\{ -NoValue- \right\}} }$ generates a negative phase shift (towards later times) and sub-unit gain. $\IfNoValueTF{-NoValue-} { \IfNoValueTF{-NoValue-} {{\mathcal{D}}^{+}_{\tau}} {{\mathcal{D}}^{+}_{\tau} \left\{ -NoValue- \right\}} } { \IfNoValueTF{-NoValue-} {{\mathcal{D}}^{+}_{\tau}} {{\mathcal{D}}^{+}_{\tau} \left\{ -NoValue- \right\}} }$ is its exact inverse and generates a positive phase shift (towards earlier times) and supra-unit gain. (b) Phase shift and (c) gain of all four temporal operators in and / across a wide range of the frequency spectrum. Note how the prospective operators $\IfNoValueTF{-NoValue-} { \IfNoValueTF{-NoValue-} {{\mathcal{D}}^{+}_{\tau}} {{\mathcal{D}}^{+}_{\tau} \left\{ -NoValue- \right\}} } { \IfNoValueTF{-NoValue-} {{\mathcal{D}}^{+}_{\tau}} {{\mathcal{D}}^{+}_{\tau} \left\{ -NoValue- \right\}} }$ and $\IfNoValueTF{-NoValue-} { \IfNoValueTF{-NoValue-} {{\mathcal{I}}^{+}_{\tau}} {{\mathcal{I}}^{+}_{\tau} \left\{ -NoValue- \right\}} } { \IfNoValueTF{-NoValue-} {{\mathcal{I}}^{+}_{\tau}} {{\mathcal{I}}^{+}_{\tau} \left\{ -NoValue- \right\}} }$ (orange) have the same shift but inverse gain; the same holds for the retrospective operators $\IfNoValueTF{-NoValue-} { \IfNoValueTF{-NoValue-} {{\mathcal{I}}^{-}_{\tau}} {{\mathcal{I}}^{-}_{\tau} \left\{ -NoValue- \right\}} } { \IfNoValueTF{-NoValue-} {{\mathcal{I}}^{-}_{\tau}} {{\mathcal{I}}^{-}_{\tau} \left\{ -NoValue- \right\}} }$ and $\IfNoValueTF{-NoValue-} { \IfNoValueTF{-NoValue-} {{\mathcal{D}}^{-}_{\tau}} {{\mathcal{D}}^{-}_{\tau} \left\{ -NoValue- \right\}} } { \IfNoValueTF{-NoValue-} {{\mathcal{D}}^{-}_{\tau}} {{\mathcal{D}}^{-}_{\tau} \left\{ -NoValue- \right\}} }$ (purple). (d) Phase shift and (e) gain of the combined operators as they appear in the neuron dynamics. Here, we choose an example forward neuron (blue) with a retrospective attention window ($\tau^\mathrm{m} = 10 \tau^\mathrm{r}$). Both the associated errors $e$ (blue) and the adjoint variables $\lambda$ (dotted) are prospective and precisely invert this phase shift, albeit with a different gain.
Figure 4: Microcircuit implementation of : key components. Representation neurons form the forward pathway (red), error neurons form the backward pathway (blue). Both classes of neurons are , likely located in different layers of cortex. Lateral connections enable information exchange and gating between the two streams. The combination of retrospective membrane and prospective output dynamics allow these neurons to tune the temporal shift of the transmitted information. Errors are also represented in dendrites, likely located in the apical tuft of signal neurons, enabling local three-factor plasticity to correct the backpropagated errors.
Figure 5: Learning with in a simple chain.(a) Network setup. A chain of two retrospective representation neurons (red) learns to mimic the output of a teacher network (identical architecture, different parameters). In , this chain is mirrored by a chain of corresponding error neurons (blue), following the microcircuit template in \ref{['fig:mc']}. We compare the effects of three learning algorithms: (green), with instantaneous errors (purple) and (point markers denote the discrete nature of the algorithm; pink, brown and orange denote different truncation windows (TW)). (b) Output of representation neurons ($r_i$, red) and error neurons ($e_i$, blue) for and instantaneous (BP). Left: before learning (i.e., both weights and membrane time constants are far from optimal). Right: after learning. (c) Evolution of weights, time constants and overall loss. Fluctuations at the scale of $10^{-10}$ are due to limits in the numerical precision of the simulation.
...and 12 more figures

Backpropagation through space, time, and the brain

TL;DR

Abstract

Backpropagation through space, time, and the brain

Authors

TL;DR

Abstract

Table of Contents

Figures (17)