Scaling Forward Gradient With Local Losses

Mengye Ren; Simon Kornblith; Renjie Liao; Geoffrey Hinton

Scaling Forward Gradient With Local Losses

Mengye Ren, Simon Kornblith, Renjie Liao, Geoffrey Hinton

TL;DR

Forward-gradient learning is explored as a biologically plausible alternative to backpropagation, addressing high variance in high-dimensional settings by perturbing activations and introducing many local greedy losses via a LocalMixer architecture. The approach also employs replicated local losses across block, patch, and group dimensions with carefully designed feature aggregators and normalization to maintain a global-informed learning signal locally. Theoretical analyses establish unbiasedness of the estimators and quantify variance, while extensive experiments show that activity-perturbed FG with local losses matches BP on MNIST and CIFAR and outperforms prior backprop-free methods on ImageNet, highlighting scalability and practical potential for biologically plausible, model-parallel learning. Overall, the work demonstrates a viable path toward scalable, local-learning-based deep nets that approach backprop performance on standard vision benchmarks and suggests design principles for future biologically inspired learning systems.

Abstract

Forward gradient learning computes a noisy directional gradient and is a biologically plausible alternative to backprop for learning deep neural networks. However, the standard forward gradient algorithm, when applied naively, suffers from high variance when the number of parameters to be learned is large. In this paper, we propose a series of architectural and algorithmic modifications that together make forward gradient learning practical for standard deep learning benchmark tasks. We show that it is possible to substantially reduce the variance of the forward gradient estimator by applying perturbations to activations rather than weights. We further improve the scalability of forward gradient by introducing a large number of local greedy loss functions, each of which involves only a small number of learnable parameters, and a new MLPMixer-inspired architecture, LocalMixer, that is more suitable for local learning. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.

Scaling Forward Gradient With Local Losses

TL;DR

Abstract

Paper Structure (47 sections, 5 theorems, 31 equations, 9 figures, 6 tables, 2 algorithms)

This paper contains 47 sections, 5 theorems, 31 equations, 9 figures, 6 tables, 2 algorithms.

Introduction
Related Work
Forward gradient and reinforcement learning.
Greedy local learning.
Asymmetric feedback weights.
Biologically plausible perturbation learning.
Forward Gradient Learning
Forward-mode automatic differentiation (AD)
Forward-mode automatic differentiation (AD)
Weight-perturbed forward gradient
Activity-perturbed forward gradient
Theoretical properties
Continuous-time rate-based models
Activation sparsity and normalization functions
Scaling with Local Losses
...and 32 more sections

Key Result

Proposition 1

$g_w(w_{ij})$ is an unbiased gradient estimator if $\{v_{ij}\}$ are independent zero-mean uni-variance random variables baydin2022forward.

Figures (9)

Figure 1: A LocalMixer network consists of several mixer blocks. A=Activation function (ReLU).
Figure 2: A LocalMixer residual block with local losses. Token mixing consists of a linear layer and channels are grouped in the channel mixing layers. Layer norm is applied before and after every linear layer. LN=Layer Norm; FC=Fully Connected layer; A=Activation function (ReLU); T=Transpose.
Figure 3: Feature aggregator designs. A) In the conventional design, average pooling is performed to aggregate features from different spatial locations. B) We propose the replicated design, features are first concatenated across groups and then averaged across spatial locations. We create copies of the same feature with different stop gradient masks so that we obtain more local losses instead of a global one. The stop gradient mask makes sure that perturbation in one spatial group corresponds to its loss function. The numerical value of the loss function is the same as the conventional design.
Figure 4: Importance of StopGradient in the InfoNCE loss, using M/8 on CIFAR-10 with 256 channels 1 group.
Figure 5: Memory and compute usage of naïve and fused implementation of replicated losses.
...and 4 more figures

Theorems & Definitions (11)

Proposition 1
proof
Proposition 2
proof
Lemma 1
proof
Remark
Proposition 3
proof
Proposition 4
...and 1 more

Scaling Forward Gradient With Local Losses

TL;DR

Abstract

Scaling Forward Gradient With Local Losses

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (11)