Table of Contents
Fetching ...

Forward Learning with Top-Down Feedback: Empirical and Analytical Characterization

Ravi Srinivasan, Francesca Mignacco, Martino Sorbaro, Maria Refinetti, Avi Cooper, Gabriel Kreiman, Giorgia Dellaferrera

TL;DR

This work unveils the connections between three key neuro-inspired learning rules, providing a link between forward-only algorithms, i.e., Forward-Forward and PEPITA, and an approximation of backpropagation, i.e., Feedback Alignment.

Abstract

"Forward-only" algorithms, which train neural networks while avoiding a backward pass, have recently gained attention as a way of solving the biologically unrealistic aspects of backpropagation. Here, we first address compelling challenges related to the "forward-only" rules, which include reducing the performance gap with backpropagation and providing an analytical understanding of their dynamics. To this end, we show that the forward-only algorithm with top-down feedback is well-approximated by an "adaptive-feedback-alignment" algorithm, and we analytically track its performance during learning in a prototype high-dimensional setting. Then, we compare different versions of forward-only algorithms, focusing on the Forward-Forward and PEPITA frameworks, and we show that they share the same learning principles. Overall, our work unveils the connections between three key neuro-inspired learning rules, providing a link between "forward-only" algorithms, i.e., Forward-Forward and PEPITA, and an approximation of backpropagation, i.e., Feedback Alignment.

Forward Learning with Top-Down Feedback: Empirical and Analytical Characterization

TL;DR

This work unveils the connections between three key neuro-inspired learning rules, providing a link between forward-only algorithms, i.e., Forward-Forward and PEPITA, and an approximation of backpropagation, i.e., Feedback Alignment.

Abstract

"Forward-only" algorithms, which train neural networks while avoiding a backward pass, have recently gained attention as a way of solving the biologically unrealistic aspects of backpropagation. Here, we first address compelling challenges related to the "forward-only" rules, which include reducing the performance gap with backpropagation and providing an analytical understanding of their dynamics. To this end, we show that the forward-only algorithm with top-down feedback is well-approximated by an "adaptive-feedback-alignment" algorithm, and we analytically track its performance during learning in a prototype high-dimensional setting. Then, we compare different versions of forward-only algorithms, focusing on the Forward-Forward and PEPITA frameworks, and we show that they share the same learning principles. Overall, our work unveils the connections between three key neuro-inspired learning rules, providing a link between "forward-only" algorithms, i.e., Forward-Forward and PEPITA, and an approximation of backpropagation, i.e., Feedback Alignment.
Paper Structure (28 sections, 27 equations, 11 figures, 6 tables, 3 algorithms)

This paper contains 28 sections, 27 equations, 11 figures, 6 tables, 3 algorithms.

Figures (11)

  • Figure 1: Different error transportations and WM configurations. Green arrows mark forward paths and orange arrows indicate error paths. (a) Feedback alignment (FA). (b) Present the Error to Perturb the Input To modulate Activity (PEPITA). (c) PEPITA with WM. (d) Forward-Forward (FF).
  • Figure 2: (a) Test accuracy as a function of epochs for experiments with the MNIST and a 1-hidden-layer network with (1024 hidden units, ReLU). Blue dots mark the "vanilla" PEPITA algorithm (without momentum) while purple crosses mark the AFA approximation (\ref{['eq:AF-layer1']}). (b) Generalization error as a function of time for the experiments with PEPITA (blue dots) and AFA (purple crosses), and the theoretical curves (App. \ref{['appendix:ODEs']}) marked by full lines. Parameters: $D=500$, $lr=.05$, erf activation, 2 hidden units in both teacher and student. (c) Alignment angle between the teacher and student second-layer weights (dark green), the student and a degenerate solution (light green), and the AF matrix and the student (orange) as a function of time. (d) Direction of the student, the AF and the teacher (and degenerate solutions). Different time shots are marked by vertical dashed lines in panel (c).
  • Figure 3: (a), (b) Test accuracy of fully connected networks with increasing depth trained with PEPITA on the CIFAR-10 dataset. (a) PEPITA uniform and PEPITA normal refer to the initialization of the weights and $F$ (Sec. \ref{['sec:deeper']}). "PEPITA-Hebbian" refers to the learning rule explained in Section \ref{['sec:hebbian']}. (b) Effect of weight decay. (c) Alignment angle between $F$ (Eq. \ref{['eq:modulatedpass']}) and $W_1 \cdot W_2$ during training with or without WM. PreM refers to pre-mirroring (Sec. \ref{['sec:deeper']}). Hyperparameters are reported in Table \ref{['tab:architectures']}. The plots indicate mean and standard deviation over 10 independent runs.
  • Figure S1: Test curve for PEPITA in its time-local formulation and time-local PEPITA with F=0 (i.e., only the last layer is trained) on the CIFAR-10 dataset. The network has 1 hidden layer with 1024 units. The forward matrices are initialized using the He normal initialization. $F$ entries are sampled from a normal distribution with standard deviation 0.5$\cdot 2\sqrt{6 /(32\cdot32\cdot3)}$. We use learning rate $0.0001$ and weight decay with $\lambda=10^{-4}$. The learning is reduced by a factor of $\times0.1$ at epoch 50. The plot indicates mean and standard deviation over 10 independent runs. Time-local PEPITA achieves a significantly higher accuracy than the time-local, F=0 scheme.
  • Figure S2: Difference of the norm of the squared activities of the first hidden layer between the clean and modulated pass in PEPITA (a) before training, (b) after 50 epochs, and (c) at the end of training. The network is a 2-hidden-layer network trained with WD with $\lambda=10^{-4}$ on the CIFAR-10 dataset. The activites are recorded on the test set. We remark that in PEPITA the input of the second forward pass is modulated by the error. Since the error decreases during training, also the difference of the activations in the two passes decreases with training. This explains why the distribution of the difference of the norm of the squared activities has a lower standard deviation in the middle of training (b) and at the end of training (c), than before training (a). In contrast, the modulation of the input in FF is constant during training, and the scope of training is maximising the difference of the goodness in the two passes.
  • ...and 6 more figures