Table of Contents
Fetching ...

On the Infinite Width and Depth Limits of Predictive Coding Networks

Francesco Innocenti, El Mehdi Achour, Rafal Bogacz

TL;DR

The paper investigates the infinite width $N$ and depth $L$ limits of Predictive Coding Networks (PCNs) to determine when PC can replicate Backpropagation (BP). It proves that, for linear networks, the set of width-stable, feature-learning parameterisations for PC matches BP, and under these parameterisations PC gradients converge to BP gradients as $N$ becomes large while $N\gg L$. It extends the analysis to depth for linear residual networks, showing BP equivalence under width- and depth-stable parameterisations with $N \gg L$ and $\alpha=1/2$, and confirms these findings with nonlinear experiments on CIFAR-10 and Fashion-MNIST, illustrating practical applicability. The work thereby unifies previous theoretical and empirical results and provides a principled path to scaling PCNs, with implications for biologically plausible credit assignment and energy-efficient AI.

Abstract

Predictive coding (PC) is a biologically plausible alternative to standard backpropagation (BP) that minimises an energy function with respect to network activities before updating weights. Recent work has improved the training stability of deep PC networks (PCNs) by leveraging some BP-inspired reparameterisations. However, the full scalability and theoretical basis of these approaches remains unclear. To address this, we study the infinite width and depth limits of PCNs. For linear residual networks, we show that the set of width- and depth-stable feature-learning parameterisations for PC is exactly the same as for BP. Moreover, under any of these parameterisations, the PC energy with equilibrated activities converges to the BP loss in a regime where the model width is much larger than the depth, resulting in PC computing the same gradients as BP. Experiments show that these results hold in practice for deep nonlinear networks, as long as an activity equilibrium seem to be reached. Overall, this work unifies various previous theoretical and empirical results and has potentially important implications for the scaling of PCNs.

On the Infinite Width and Depth Limits of Predictive Coding Networks

TL;DR

The paper investigates the infinite width and depth limits of Predictive Coding Networks (PCNs) to determine when PC can replicate Backpropagation (BP). It proves that, for linear networks, the set of width-stable, feature-learning parameterisations for PC matches BP, and under these parameterisations PC gradients converge to BP gradients as becomes large while . It extends the analysis to depth for linear residual networks, showing BP equivalence under width- and depth-stable parameterisations with and , and confirms these findings with nonlinear experiments on CIFAR-10 and Fashion-MNIST, illustrating practical applicability. The work thereby unifies previous theoretical and empirical results and provides a principled path to scaling PCNs, with implications for biologically plausible credit assignment and energy-efficient AI.

Abstract

Predictive coding (PC) is a biologically plausible alternative to standard backpropagation (BP) that minimises an energy function with respect to network activities before updating weights. Recent work has improved the training stability of deep PC networks (PCNs) by leveraging some BP-inspired reparameterisations. However, the full scalability and theoretical basis of these approaches remains unclear. To address this, we study the infinite width and depth limits of PCNs. For linear residual networks, we show that the set of width- and depth-stable feature-learning parameterisations for PC is exactly the same as for BP. Moreover, under any of these parameterisations, the PC energy with equilibrated activities converges to the BP loss in a regime where the model width is much larger than the depth, resulting in PC computing the same gradients as BP. Experiments show that these results hold in practice for deep nonlinear networks, as long as an activity equilibrium seem to be reached. Overall, this work unifies various previous theoretical and empirical results and has potentially important implications for the scaling of PCNs.
Paper Structure (58 sections, 4 theorems, 61 equations, 27 figures, 2 tables)

This paper contains 58 sections, 4 theorems, 61 equations, 27 figures, 2 tables.

Key Result

Theorem 1

Consider the $(a_\ell, b_\ell, c, d)$ parameterisation of linear MLPs (Eqs. eq:mlp-output-function-eq:mlp-first-activation), and assume PCNs with converged activities that therefore learn on the equilibrated energy (Eq. eq:pc-equilib-energy). Then, there exists a set of one-dimensional parameterisat

Figures (27)

  • Figure 1: Under width- and depth-stable feature-learning parameterisations of linear residual networks, PC converges to BP when the model width is much larger than the depth, $N \gg L$. We trained linear residual networks on CIFAR-10 with the mean-field parameterisation (as defined in Table \ref{['tab:params-summary']}) and depth scaling exponent $\alpha=1/2$ (§\ref{['sec:pc-depth-params']}). Plotted are the cosine similarities between the equilibrated energy (Eq. \ref{['eq:pc-equilib-energy']}) gradients (PC) and the MSE loss (Eq. \ref{['eq:mse-loss']}) gradients (BP) at different training steps $t$.
  • Figure 2: Under width-stable and feature-learning parameterisations of linear MLPs, PC converges to BP at large width. We trained deep linear MLPs ($L=5$) of varying widths $N$ with full-batch GD on a toy task with binary labels. All models used the mean-field parameterisation as defined in Table \ref{['tab:params-summary']}. For comparative results with the SP, see Figure \ref{['fig:sp-toy-linear-net']}. (Left) As predicted by Eq. \ref{['eq:rescaling-width-order']}, the equilibrated energy rescaling $s(\boldsymbol{\theta})$ approaches one as $N \rightarrow \infty$. (Middle) As a result, the equilibrated energy $\mathcal{F}^*(\boldsymbol{\theta})$ (Eq. \ref{['eq:pc-equilib-energy']}) converges to the MSE loss $\mathcal{L}(\boldsymbol{\theta})$ (Eq. \ref{['eq:mse-loss']}), and PC effectively computes the same gradients as BP (Right). The theoretical loss was calculated using dynamical mean field theory (see §\ref{['exp-details']} for more details). For additional results including on CIFAR-10, see Figures \ref{['fig:mupc-toy-linear-net-extra']} & \ref{['fig:mean-field-linear-mlp-16-cifar']}.
  • Figure 3: Empirical verification of Eq. \ref{['eq:resnet-rescaling-width-depth-order']}. For linear residual networks trained on CIFAR-10, we plot the empirical equilibrated energy rescaling $s(\boldsymbol{\theta})$ (Eq. \ref{['eq:resnet-equilib-energy-rescaling']}) minus one as a function of the width $N$ and depth $L$, against the $L/N$ theoretical prediction (Eq. \ref{['eq:resnet-rescaling-width-depth-order']}). Note that the same scaling applies to MLPs with infinite depth (Figure \ref{['fig:mean-field-rescaling-mlp-cifar']}), but their forward pass is depth-unstable as discussed in §\ref{['sec:learning-regimes']}.
  • Figure 4: PC also converges to BP on nonlinear networks that are much wider than deeper, under stable and feature-learning parameterisations. With a setup similar to Figure \ref{['fig:mean-field-linear-resnet-width-vs-depth-cifar']}, we fix the width at $N=2048$ and train nonlinear residual networks of depths $L \in \{2, 16\}$ with different activity step sizes $\beta$ (§\ref{['sec:pcns']}), under the mean-field parameterisation (see Table \ref{['tab:params-summary']}). We tested Tanh (Left) and ReLU (Right) as activation functions. Plotted are the cosine similarities between the MSE loss (Eq. \ref{['eq:mse-loss']}) gradients (BP) and the energy (Eq. \ref{['eq:pc-equilib-energy']}) gradients (PC) at the last step of activity (GD) optimisation (Eq. \ref{['eq:pc-infer']}), for different activity learning rates $\beta$ and training steps $t$. See §\ref{['exp-details']} for more details and Figures \ref{['fig:mean-field-toy-resnet-tanh']} & \ref{['fig:mean-field-toy-resnet-relu']} for additional results.
  • Figure A.1: Under stable and feature-learning parameterisations, PC converges to BP for model width much larger than depth even on nonlinear networks. With a similar setup to Figure \ref{['fig:mupc-toy-linear-net']}, we plot results for nonlinear residual networks ($L=5$) with Tanh as activation function. As in the linear case, we see that the PC energy at numerical convergence of the activities $\mathcal{F}(\mathbf{z}_{T_{\text{max}}})$ converges to the BP MSE loss (Eq. \ref{['eq:mse-loss']}) for sufficiently large width (Left), resulting in PC computing the same gradients as BP (Right). See the next Figure for results with ReLU and Figures \ref{['fig:mean-field-toy-mlp-tanh']}-\ref{['fig:mean-field-toy-mlp-relu']} for similar results with MLPs.
  • ...and 22 more figures

Theorems & Definitions (4)

  • Theorem 1: Width-stable and feature-learning parameterisations for linear PCNs.
  • Corollary 1: PC convergence to BP on wide linear MLPs.
  • Theorem 2
  • Corollary 2: PC convergence to BP on deep and wide linear residual networks.