A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks

Tommaso Salvatori; Yuhang Song; Yordan Yordanov; Beren Millidge; Zhenghua Xu; Lei Sha; Cornelius Emde; Rafal Bogacz; Thomas Lukasiewicz

A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks

Tommaso Salvatori, Yuhang Song, Yordan Yordanov, Beren Millidge, Zhenghua Xu, Lei Sha, Cornelius Emde, Rafal Bogacz, Thomas Lukasiewicz

TL;DR

The paper tackles slow, unstable training in predictive coding networks and introduces incremental predictive coding (iPC), which updates inference and synaptic weights in parallel at every time step. Grounded in a variational free energy framework and incremental EM, iPC provides convergence guarantees and removes the need for external control signals. Empirically, iPC consistently outperforms the original PC on image classification benchmarks and matches or approaches BP performance on larger models, while offering improved calibration under distribution shift and better parameter efficiency. The approach extends to language modeling tasks, achieving robust perplexities comparable to BP and substantially better stability than PC, highlighting its practical potential for neuroscience-inspired learning on large-scale tasks.

Abstract

Predictive coding networks are neuroscience-inspired models with roots in both Bayesian statistics and neuroscience. Training such models, however, is quite inefficient and unstable. In this work, we show how by simply changing the temporal scheduling of the update rule for the synaptic weights leads to an algorithm that is much more efficient and stable than the original one, and has theoretical guarantees in terms of convergence. The proposed algorithm, that we call incremental predictive coding (iPC) is also more biologically plausible than the original one, as it it fully automatic. In an extensive set of experiments, we show that iPC constantly performs better than the original formulation on a large number of benchmarks for image classification, as well as for the training of both conditional and masked language models, in terms of test accuracy, efficiency, and convergence with respect to a large set of hyperparameters.

A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks

TL;DR

Abstract

Paper Structure (26 sections, 1 theorem, 15 equations, 9 figures, 4 tables, 3 algorithms)

This paper contains 26 sections, 1 theorem, 15 equations, 9 figures, 4 tables, 3 algorithms.

Introduction
Preliminaries
Predictive Coding
Incremental Predictive Coding
Efficiency
Classification Experiments
Robustness and Calibration
Language Model Experiments
Related Works
Discussion
Aknowledgements
A Discussion on Biological Plausibility
Weight Transport
Pseudocodes of Z-IL and PC
On the efficiency of PC, BP, and iPC
...and 11 more sections

Key Result

Theorem 3.1

Let $M$ and $M'$ be two equivalent networks with $L$ layers trained on the same dataset. Let $M$ be trained using BP, and $M'$ be trained using iPC. Then, the time complexity needed to perform one full update of the weights is $\mathcal{O}(1)$ for iPC and $\mathcal{O}(L)$ for BP.

Figures (9)

Figure 1: (a) An example of a hierarchical Gaussian generative model with three layers. (b) Comparison of the temporal training dynamics of PC, Z-IL, and iPC, where Z-IL is a variant of PC that is equivalent to BP, originally introduced in Song2020. We assume that we train the networks on a dataset for supervised learning for a period of time $T$. Here, $t$ is the time axis during inference, which always starts at $t=0$. The squares represent nodes in one layer, and pink rounded rectangles indicate when the connection weights are modified: PC (1st row) first conducts inference on the hidden layers, according to Eq. equation \ref{['eq:x']}, until convergence, and then it updates the weights via Eq. \ref{['eq:theta']}. Z-IL (2nd row) only updates the weights at specific inference moments depending on which layer the weights belong to. To conclude, iPC updates the weights at every time step $t$, while performing inference in parallel.
Figure 2: Left and centre: Decrease of the energy of generative models as a function of the number of iterations performed from the beginning of the training process. Right: Training loss of different classifiers in a full-batch training regime as a function of the number of non-parallel matrix multiplications performed from the beginning of the training process.
Figure 3: Left: Robustness of BP and iPC under distribution shift (AlexNet on CIFAR10 under five different intensities of the corruptions rotation, Gaussian blur, Gaussian noise, hue, brightness, and contrast). iPC maintains model calibration significantly better than BP under distribution shift. Right: Dev perplexity during training of the best performing masked language models.
Figure 4: Standard and dendritic neural implementation of predictive coding. The dendritic implementation makes use of interneurons $i_l = W_l x_l$ (according to the notation used in the figure). Both implementations have the same equations for all the updates, and are thus equivalent; however, dendrites allow a neural implementation that does not take error nodes into account, improving the biological plausibility of the model. Figure taken and adapted from whittington2019theories.
Figure 5: Graphical PClustration of the efficiency over backward SMMs of BP and iPC on a $3$-layer network. iPC never clears the error (red neurons), while BP clears it after every update. This allows iPC to perform $5$ full and $2$ partial updates of the weights in the first $6$ SMMs. In the same time frame, BP only performs $3$ full updates. Note that the SMMs of forward passes are excluded for simplicity, w.l.o.g., as the insight from this example generalizes to the SMMs of the forward pass.
...and 4 more figures

Theorems & Definitions (1)

Theorem 3.1

A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks

TL;DR

Abstract

A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (1)