Table of Contents
Fetching ...

Tight Stability, Convergence, and Robustness Bounds for Predictive Coding Networks

Ankur Mali, Tommaso Salvatori, Alexander Ororbia

TL;DR

This work rigorously analyze the stability, robustness, and convergence of PC through the lens of dynamical systems theory, and shows that PC is Lyapunov stable under mild assumptions on its loss and residual energy functions, which implies intrinsic robustness to small random perturbations due to its well-defined energy-minimizing dynamics.

Abstract

Energy-based learning algorithms, such as predictive coding (PC), have garnered significant attention in the machine learning community due to their theoretical properties, such as local operations and biologically plausible mechanisms for error correction. In this work, we rigorously analyze the stability, robustness, and convergence of PC through the lens of dynamical systems theory. We show that, first, PC is Lyapunov stable under mild assumptions on its loss and residual energy functions, which implies intrinsic robustness to small random perturbations due to its well-defined energy-minimizing dynamics. Second, we formally establish that the PC updates approximate quasi-Newton methods by incorporating higher-order curvature information, which makes them more stable and able to converge with fewer iterations compared to models trained via backpropagation (BP). Furthermore, using this dynamical framework, we provide new theoretical bounds on the similarity between PC and other algorithms, i.e., BP and target propagation (TP), by precisely characterizing the role of higher-order derivatives. These bounds, derived through detailed analysis of the Hessian structures, show that PC is significantly closer to quasi-Newton updates than TP, providing a deeper understanding of the stability and efficiency of PC compared to conventional learning methods.

Tight Stability, Convergence, and Robustness Bounds for Predictive Coding Networks

TL;DR

This work rigorously analyze the stability, robustness, and convergence of PC through the lens of dynamical systems theory, and shows that PC is Lyapunov stable under mild assumptions on its loss and residual energy functions, which implies intrinsic robustness to small random perturbations due to its well-defined energy-minimizing dynamics.

Abstract

Energy-based learning algorithms, such as predictive coding (PC), have garnered significant attention in the machine learning community due to their theoretical properties, such as local operations and biologically plausible mechanisms for error correction. In this work, we rigorously analyze the stability, robustness, and convergence of PC through the lens of dynamical systems theory. We show that, first, PC is Lyapunov stable under mild assumptions on its loss and residual energy functions, which implies intrinsic robustness to small random perturbations due to its well-defined energy-minimizing dynamics. Second, we formally establish that the PC updates approximate quasi-Newton methods by incorporating higher-order curvature information, which makes them more stable and able to converge with fewer iterations compared to models trained via backpropagation (BP). Furthermore, using this dynamical framework, we provide new theoretical bounds on the similarity between PC and other algorithms, i.e., BP and target propagation (TP), by precisely characterizing the role of higher-order derivatives. These bounds, derived through detailed analysis of the Hessian structures, show that PC is significantly closer to quasi-Newton updates than TP, providing a deeper understanding of the stability and efficiency of PC compared to conventional learning methods.
Paper Structure (24 sections, 19 theorems, 128 equations, 3 figures, 2 tables)

This paper contains 24 sections, 19 theorems, 128 equations, 3 figures, 2 tables.

Key Result

Theorem 3.1

Let $M$ be a PCN that minimizes a free energy $F = L + \tilde{E}$, where $L$ is the backprop loss and $\tilde{E}$ is the residual energy. Assume the activation function $f$ and its derivatives $f'$, $f"$, and $f"'$ are Lipschitz continuous with constants $K$, $K'$, $K"$, and $K"'$, respectively. The

Figures (3)

  • Figure 1: Convergence comparison of backprop (BP; blue) and predictive coding (PC; red) in a convolutional network with MSE loss trained on MNIST for $30K steps$. Panels A and B show the training (light colors) and test (dark colors) loss (A) and accuracy (B) for a 5-layer network $($TeLU activation, optimizer $=$ SGD with momentum $0.9$, lr $= 0.01$, batch_size$=100$$)$ trained using PC (red) and BP (blue). Panels C and D depict the relative error (C) and angle (D) between parameter updates (d$\theta$) and the negative gradient of the loss at each layer. While PC and BP achieve comparable accuracies in all experiments, the differences in the parameter updates highlight the nuances between the two approaches. (Note: We adapted the code of rosenbaum2022relationship to generate these plots.)
  • Figure 2: Convergence analysis with the tanh activation -- same setting as in Figure \ref{['fig:pc_convergence_telu']}.
  • Figure 3: Convergence comparison of backprop (BP; blue) and predictive coding (PC; red) in a convolutional network with MSE loss trained on the CIFAR-10 dataset over first $4K$ steps. Panels A and B show the training (light colors) and test (dark colors) loss (A) and accuracy (B) for a 5-layer network $($TeLU activation, optimizer $=$ SGD with momentum $0.9$, lr $= 0.01$, batch_size$=100$$)$ trained using PC shown in red) and BP (shown in blue). Panels C and D depict the relative error (C) and angle (D) between the parameter updates,(d$\theta$) and the negative gradient of the loss at each layer

Theorems & Definitions (47)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Definition 2.5
  • Definition 2.6
  • Definition 2.7
  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • ...and 37 more