Table of Contents
Fetching ...

Stochastic Forward-Forward Learning through Representational Dimensionality Compression

Zhichao Zhu, Yang Qi, Hengyuan Ma, Wenlian Lu, Jianfeng Feng

TL;DR

The paper addresses learning neural networks without backpropagation or curated negative samples by extending Forward-Forward learning with a dimensionality-based objective. It introduces effective dimensionality (ED) as a second-order statistic and optimizes a two-term loss that minimizes within-class ED while maximizing across-sample ED, using noise-augmented copies to avoid explicit negatives and adopting energy-based inference via the mean squared outputs. Empirical results on MNIST, CIFAR-10, and CIFAR-100 show competitive performance with other non-BP methods, with noise and dimensionality compression playing crucial roles. The approach offers a biologically plausible, hardware-friendly alternative and situates itself within connections to self-supervised and predictive-coding paradigms, though scaling to larger models remains an open challenge.

Abstract

The Forward-Forward (FF) learning algorithm provides a bottom-up alternative to backpropagation (BP) for training neural networks, relying on a layer-wise "goodness" function with well-designed negative samples for contrastive learning. Existing goodness functions are typically defined as the sum of squared postsynaptic activations, neglecting correlated variability between neurons. In this work, we propose a novel goodness function termed dimensionality compression that uses the effective dimensionality (ED) of fluctuating neural responses to incorporate second-order statistical structure. Our objective minimizes ED for noisy copies of individual inputs while maximizing it across the sample distribution, promoting structured representations without the need to prepare negative samples.We demonstrate that this formulation achieves competitive performance compared to other non-BP methods. Moreover, we show that noise plays a constructive role that can enhance generalization and improve inference when predictions are derived from the mean of squared output, which is equivalent to making predictions based on an energy term. Our findings contribute to the development of more biologically plausible learning algorithms and suggest a natural fit for neuromorphic computing, where stochasticity is a computational resource rather than a nuisance. The code is available at https://github.com/ZhichaoZhu/StochasticForwardForward

Stochastic Forward-Forward Learning through Representational Dimensionality Compression

TL;DR

The paper addresses learning neural networks without backpropagation or curated negative samples by extending Forward-Forward learning with a dimensionality-based objective. It introduces effective dimensionality (ED) as a second-order statistic and optimizes a two-term loss that minimizes within-class ED while maximizing across-sample ED, using noise-augmented copies to avoid explicit negatives and adopting energy-based inference via the mean squared outputs. Empirical results on MNIST, CIFAR-10, and CIFAR-100 show competitive performance with other non-BP methods, with noise and dimensionality compression playing crucial roles. The approach offers a biologically plausible, hardware-friendly alternative and situates itself within connections to self-supervised and predictive-coding paradigms, though scaling to larger models remains an open challenge.

Abstract

The Forward-Forward (FF) learning algorithm provides a bottom-up alternative to backpropagation (BP) for training neural networks, relying on a layer-wise "goodness" function with well-designed negative samples for contrastive learning. Existing goodness functions are typically defined as the sum of squared postsynaptic activations, neglecting correlated variability between neurons. In this work, we propose a novel goodness function termed dimensionality compression that uses the effective dimensionality (ED) of fluctuating neural responses to incorporate second-order statistical structure. Our objective minimizes ED for noisy copies of individual inputs while maximizing it across the sample distribution, promoting structured representations without the need to prepare negative samples.We demonstrate that this formulation achieves competitive performance compared to other non-BP methods. Moreover, we show that noise plays a constructive role that can enhance generalization and improve inference when predictions are derived from the mean of squared output, which is equivalent to making predictions based on an energy term. Our findings contribute to the development of more biologically plausible learning algorithms and suggest a natural fit for neuromorphic computing, where stochasticity is a computational resource rather than a nuisance. The code is available at https://github.com/ZhichaoZhu/StochasticForwardForward

Paper Structure

This paper contains 16 sections, 13 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Assessing the informativeness of neuronal responses through effective dimensionality (ED).a. Illustration of what ED quantifies for a zero-mean Gaussian distribution. The eigenvalues $\lambda_i$ computed from the uncentered second moment. ED approaches 1 when variance is concentrated along a single principal direction (left) and increases toward 2 as variance becomes isotropic (right). b. Influence of mean and covariance on ED. In the left and middle panels $\mu_2=0, \sigma_1^2 = \sigma_2^2 = 1$ and $\mu_2 = \mu_1 = 0, \sigma_2^2 = 5$ are fixed while varying $\mu_1$ and $\sigma_1^2$ respectively. In the right panel, $\mu = 0, \sigma^2 = 1$ are fixed while varying the correlation coefficient $\rho_{12}$. c. Example tuning curves showing neurons selectively responsive to some category-informative features, forming a population code that encodes categorical information. d. ED as a measure of class separability in a two-dimensional response space. Points represent noisy samples from two classes (blue and orange). Within-class responses form clusters with low ED, whereas their mixture (whose uncentered covariance is represented by the dashed gray ellipse) exhibits higher ED, reflecting representational diversity.
  • Figure 2: Network architecture and training pipeline. The first dropout layer generates $N$ noisy variants per input and remains active during inference, while the dropout in the linear classifier is used only for regularization during training. Batch normalization (BN) layers stabilize inputs and contain no trainable parameters. Each convolutional block includes a Fixed Orthonormal Projection (FOP) module that projects its output onto a subspace with a pregenerated random orthonormal basis before computing the dimensionality compression loss $L$. Training proceeds in two phases: (1) Each convolutional block is trained layer-wise for 3 epochs using the proposed loss function $L$. (2) The convolutional blocks are then frozen, and a linear classifier is trained for 60 epochs using cross-entropy loss, where prediction score for each sample is computed as the mean of squared classifier outputs over the noisy variants. The overall architecture and training pipeline are consistent across all experiments, except for the classifier's input dimensionality, which varies by dataset.
  • Figure 3: Effect of the trade-off factor $\alpha$ on weight optimization.a. Visualization of the first-layer convolutional kernels trained on CIFAR-10 under different values of $\alpha$. Each column shows the top 10 channels ranked by the standard deviation of their weights trained with different $\alpha$. b. Cosine orthogonality score (COS, blue line) and mean standard deviation of first-layer kernels (orange line) as functions of $\alpha$. A higher COS indicates greater diversity among channels' weights. c. Classification accuracy comparison for $\alpha = 0.0$ and $\alpha = 0.5$ (default). Error bars denote one standard deviation across 5 independent training runs.
  • Figure 4: Factors affecting task performance.a. Classification accuracy under different inference strategies. $\mathbb{E}[Y^2]$: proposed method, using the mean squared outputs (energy) based on generated noisy samples as prediction score; $\mathbb{E}[Y]$: uses the mean of outputs as prediction score. Direct forward: standard inference without noise, using raw inputs. b. Accuracy under different training schemes. Unsup: proposed method, where $\text{ED}_c$ is computed at the instance level based on generated noisy samples. Sup+sampling: generated noisy samples are further grouped by class labels before computing $\text{ED}_c$. Sup: computes $\text{ED}_c$ directly on labeled data without the need to generate noisy samples. c. Accuracy under different projection strategies. Graded: block outputs are projected with gradually decreasing dimensions (30-20-10 for MNIST and CIFAR-10; 90-150-100 for CIFAR-100). Fixed: all blocks projected to a constant dimension equal to the number of classes. Random: projected to a randomly selected dimension per block. None: no projection.
  • Figure 5: Layerwise analysis of representations after traininga. Effective dimensionality (ED) of block outputs projected into a lower-dimensional space. $\text{ED}_d$ and $\text{ED}_c$ are colored by blue and orange respectively and the shades denote one standard deviation of $\text{ED}_c$ across classes. The horizontal dashed line marks the projection dimensionality, and the black dashed line shows the compression ratio $\text{ED}_d / \text{ED}_c$b. Linear separability of each block’s representation, measured by training a linear classifier on the output of each block. c. Information decomposition of classifier outputs from each block, assuming a Gaussian mixture model. We report total mutual information (tot), linearly decodable informationlin, and second-order interaction terms (cor), where tot = lin + cor. Results shown are from a randomly selected model trained on CIFAR-10.
  • ...and 4 more figures