Only Strict Saddles in the Energy Landscape of Predictive Coding Networks?

Francesco Innocenti; El Mehdi Achour; Ryan Singh; Christopher L. Buckley

Only Strict Saddles in the Energy Landscape of Predictive Coding Networks?

Francesco Innocenti, El Mehdi Achour, Ryan Singh, Christopher L. Buckley

TL;DR

This work shows that PC inference makes the loss landscape of feedforward networks more benign and robust to vanishing gradients, while also highlighting the fundamental challenge of scaling PC to very deep models.

Abstract

Predictive coding (PC) is an energy-based learning algorithm that performs iterative inference over network activities before updating weights. Recent work suggests that PC can converge in fewer learning steps than backpropagation thanks to its inference procedure. However, these advantages are not always observed, and the impact of PC inference on learning is not theoretically well understood. Here, we study the geometry of the PC energy landscape at the inference equilibrium of the network activities. For deep linear networks, we first show that the equilibrated energy is simply a rescaled mean squared error loss with a weight-dependent rescaling. We then prove that many highly degenerate (non-strict) saddles of the loss including the origin become much easier to escape (strict) in the equilibrated energy. Our theory is validated by experiments on both linear and non-linear networks. Based on these and other results, we conjecture that all the saddles of the equilibrated energy are strict. Overall, this work suggests that PC inference makes the loss landscape more benign and robust to vanishing gradients, while also highlighting the fundamental challenge of scaling PC to deeper models.

Only Strict Saddles in the Energy Landscape of Predictive Coding Networks?

TL;DR

Abstract

Paper Structure (33 sections, 3 theorems, 52 equations, 12 figures)

This paper contains 33 sections, 3 theorems, 52 equations, 12 figures.

Introduction
Summary of contributions
Preliminaries
Notation.
Deep Linear Networks (DLNs)
Predictive coding (PC)
Theoretical results
Equilibrated energy as rescaled MSE
Analysis of the origin saddle ($\boldsymbol{\theta} = \mathbf{0}$)
Analysis of other saddles
Experiments
Discussion
Implications
Limitations
Appendix
...and 18 more sections

Key Result

Theorem 1

For any DLN parameterised by $\boldsymbol{\theta} \coloneq (\mathbf{W}_1, \dots, \mathbf{W}_L)$ with input and output $(\mathbf{x}_i, \mathbf{y}_i)$, the PC energy (Eq. eq2) at the exact inference equilibrium $\partial \mathcal{F}/\partial \mathbf{z} = \mathbf{0}$ is the following rescaled MSE loss where the rescaling is $\mathbf{S} = \mathbf{I}_{d_y} + \sum_{\ell=2}^L (\mathbf{W}_{L:\ell})(\math

Figures (12)

Figure 1: Empirical verification of the theoretical equilibrated energy of deep linear networks (Theorem \ref{['thm1']}). For different datasets, we plot the energy (Eq. \ref{['eq2']}) at the numerical inference equilibrium $\mathcal{F}|_{\partial \mathcal{F}/\partial \mathbf{z}\approx0}$ for DLNs with different number of hidden layers $H \in \{2, 5, 10\}$ (see §\ref{['exp-details']} for more details), observing an excellent match with the theoretical prediction (Eq. \ref{['eq5']}).
Figure 2: Toy examples illustrating the (Theorem \ref{['thm2']}) result that the saddle at the origin of the equilibrated energy is strict independent of network depth. We plot the MSE loss $\mathcal{L}(\boldsymbol{\theta})$ (top) and equilibrated energy landscape $\mathcal{F}^*(\boldsymbol{\theta})$ (middle) around the origin for 3 linear networks trained with SGD on a toy problem (see §\ref{['exp-details']} for details). We also show the training losses for a representative run with initialisation close to the origin (bottom). For the one-dimensional networks, we visualise the landscape around the origin as well as the SGD updates. For the wide network, we project the landscape onto the maximum and minimum eigenvectors of the Hessian, following bottcher2024visualizing. Note that in this case the loss is flat because the Hessian at the origin is zero for $H > 1$ (Eq. \ref{['eq6']}).
Figure 3: Empirical verification of the Hessian at the origin of the equilibrated energy for DLNs tested on toy data. We show the Hessian and its eigenspectrum at the origin of the MSE loss (top) and equilibrated energy (middle) for DLNs with Gaussian target $\mathbf{y}=-\mathbf{x}$ where $\mathbf{x} \sim \mathcal{N}(1, 0.1)$ (see §\ref{['exp-details']} for details). Note that purple bars show overlapping loss and energy Hessian eigendensity. In the right panel, we vary one of the output dimensions to be $y_2 = x_2$. We confirm the strictness of the origin saddle in the equilibrated energy and observe an excellent numerical validation of our theoretical Hessian (Eq. \ref{['eq8']}). Figure \ref{['supp-fig-2']} shows the same results for one-dimensional networks, and Figure \ref{['fig4']} shows similar results for more realistic datasets.
Figure 4: Empirical verification of the Hessian eigenspectrum at the origin of the equilibrated energy for DLNs tested on more realistic datasets. This shows similar results to Figure \ref{['fig3']} for the more realistic datasets MNIST and MNIST-1D greydanus2020scaling (see §\ref{['exp-details']} for details). We again find a perfect match between theory and experiment for DLNs with different number of hidden layers $H \in \{1, 2, 4\}$, confirming the strictness of the origin saddle of the equilibrated energy.
Figure 5: PC escapes the origin saddle much faster than BP with SGD on non-linear networks. We plot the training loss for a representative run of BP and PC for linear and non-linear networks trained on standard image classification tasks (see §\ref{['exp-details']} for details). All networks were initialised close to the origin with scale $\sigma = 5e^{-3})$, and trained with SGD and learning rate $\eta = 1e^{-3}$. The networks trained on MNIST and Fashion-MNIST had 5 fully connected layers, while those trained on CIFAR-10 had a convolutional architecture. Figure \ref{['supp-fig-5']} shows the corresponding weight gradient norms during training. Results were consistent across different random seeds.
...and 7 more figures

Theorems & Definitions (4)

Definition 1
Theorem 1: Equilibrated energy of DLNs
Theorem 2: Strictness of origin saddle of the equilibrated energy
Theorem 3: Strictness of zero-rank saddles of the equilibrated energy

Only Strict Saddles in the Energy Landscape of Predictive Coding Networks?

TL;DR

Abstract

Only Strict Saddles in the Energy Landscape of Predictive Coding Networks?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (4)