Table of Contents
Fetching ...

Self-Assembly of a Biologically Plausible Learning Circuit

Qianli Liao, Liu Ziyin, Yulu Gan, Brian Cheung, Mark Harnett, Tomaso Poggio

TL;DR

A circuit for updating the weights in a network that is biologically plausible, works as well as backpropagation, and leads to verifiable predictions about the anatomy and the physiology of a characteristic motif of four plastic synapses between ascending and descending cortical streams is proposed.

Abstract

Over the last four decades, the amazing success of deep learning has been driven by the use of Stochastic Gradient Descent (SGD) as the main optimization technique. The default implementation for the computation of the gradient for SGD is backpropagation, which, with its variations, is used to this day in almost all computer implementations. From the perspective of neuroscientists, however, the consensus is that backpropagation is unlikely to be used by the brain. Though several alternatives have been discussed, none is so far supported by experimental evidence. Here we propose a circuit for updating the weights in a network that is biologically plausible, works as well as backpropagation, and leads to verifiable predictions about the anatomy and the physiology of a characteristic motif of four plastic synapses between ascending and descending cortical streams. A key prediction of our proposal is a surprising property of self-assembly of the basic circuit, emerging from initial random connectivity and heterosynaptic plasticity rules.

Self-Assembly of a Biologically Plausible Learning Circuit

TL;DR

A circuit for updating the weights in a network that is biologically plausible, works as well as backpropagation, and leads to verifiable predictions about the anatomy and the physiology of a characteristic motif of four plastic synapses between ascending and descending cortical streams is proposed.

Abstract

Over the last four decades, the amazing success of deep learning has been driven by the use of Stochastic Gradient Descent (SGD) as the main optimization technique. The default implementation for the computation of the gradient for SGD is backpropagation, which, with its variations, is used to this day in almost all computer implementations. From the perspective of neuroscientists, however, the consensus is that backpropagation is unlikely to be used by the brain. Though several alternatives have been discussed, none is so far supported by experimental evidence. Here we propose a circuit for updating the weights in a network that is biologically plausible, works as well as backpropagation, and leads to verifiable predictions about the anatomy and the physiology of a characteristic motif of four plastic synapses between ascending and descending cortical streams. A key prediction of our proposal is a surprising property of self-assembly of the basic circuit, emerging from initial random connectivity and heterosynaptic plasticity rules.
Paper Structure (21 sections, 5 theorems, 29 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 5 theorems, 29 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Consider a ReLU neural network with an arbitrary width and depth and with an overparametrized downstream pathway. If both $V^\ell$ and $\bar{V}^\ell$ are full-rank for all $\ell$, then, for any $x$ such that for all $\ell \in [L]$, $\Delta V^\ell = O (\epsilon)$ and $\Delta \bar{V}^\ell =O(\epsilon) for a positive definite matrix $H = \bar{V}^\ell (\bar{V}^\ell)^T$, and $\Delta_{\rm sgd} W^\ell =

Figures (5)

  • Figure 1: A scheme of the upstream-downstream synaptic motif. A: The overall scheme for the upstream-downstream architecture. The upstream consists of a standard fully-connected neural network with multiple layers, possibly corresponding to a multi-region processing pathway in the cortex like V1-V2-V4-IT. The output of the upstream network goes to an error processing module (possibly corresponding to PFC in the brain). The error module computes a local error signal that can be used to immediately train the last layer. This error signal is also sent to the feedback (downstream) pathway, which processes information layer by layer downwards. The black dashed arrows represent non-learnable (identity) connections. Red solid arrows represent learnable connections, each parameterized by a fully-connected weight matrix. B: a biological sketch of the smallest unit of the connection motif of A. The neurons in B (as well as the abstract forms in A and C) are illustrated as pyramidal neurons, which are common in the cortex. Pyramidal neurons typically have axons extending from the base of the cell body, while their dendrites can grow from the top (apical dendrites) or bottom (basal dendrites). Only apical dendrites are illustrated here for simplicity. C: a mathematical description of this unit. Each arrow (and corresponding axons, dendrites and synapses) represents a set of full connections between two groups of neurons, parameterized by a weight matrix $W$ or $V$. Every $h$ in C is a vector and refers to the activations of a group of neurons. The connection matrix $V$ allows upstream and downstream networks to have different number of hidden units in corresponding layers.
  • Figure 2: The impact of the width of the downstream pathway on performance. We use different activation functions in the downstream pathway. The width of the upstream pathway was kept constant at 200, while varying the width of the downstream pathway. We find that having a wider downstream network improves the performance of the feedforward network up to an overparametrization ratio of $7.5$. In biology, the ratios can also depend on biological constraints such as energy consumption or wiring volume and may be region-dependent.
  • Figure 3: The matrices $\bar{V}\bar{V}^T$ (upper) and $WW^T$ (lower) before and after training. Also, recall that $\bar{V}\bar{V}^T=H$ is the matrix learning rate for $W$ (Theorem \ref{['theo:main']}). The neurons within the same layer become correlated after training. The spectra of the weight matrices of the two pathway are interestingly found to be effectively low-rank (right). This visualization was done using the SVHN dataset with a network having 8 hidden layers.
  • Figure 4: Ablation study on roles of $\bar{V}$, $\bar{W}$ and $V$ on CIFAR-10. Here, we make a subset of all interconnections that are not trainable. The legend labels indicate trainable components. "Linear" refers to a linear network trained with SGD. We see that (1) making $\bar{V}$ plastic is more important in making other connections plastic, and (2) making everything plastic significantly improves the performance further.
  • Figure 5: Performance of a four-layer MLP for different choices of learning rates. Interestingly, the best performance is achieved when $\eta_V$ is the smallest.

Theorems & Definitions (11)

  • Theorem 1
  • Remark
  • Lemma 1
  • proof
  • Definition 1
  • Lemma 2
  • proof
  • Theorem 2
  • proof
  • Proposition 1
  • ...and 1 more