Table of Contents
Fetching ...

Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation

Satoki Ishikawa, Rio Yokota, Ryo Karakida

TL;DR

This work addresses stable local learning via layer-wise targets and losses, focusing on Predictive Coding (PC) and Target Propagation (TP). It develops the maximal update parameterization ($\mu$P) in the infinite-width limit for PC and TP, enabling hyperparameter transfer across widths ($\mu$Transfer) and revealing that PC's gradients can interpolate between first-order GD and Gauss-Newton-like forms, while TP biases toward feature learning. The analysis shows that in deep linear networks PC's gradient behavior depends on parameterization and inference size, and that TP lacks a kernel regime due to the independent treatment of the feedback channel; these insights elucidate when local learning can emulate or diverge from BP. Collectively, the results provide a theoretical foundation for scalable local learning, offering practical guidance for cross-width hyperparameter transfer and highlighting fundamental differences between PC and TP in large networks.

Abstract

Local learning, which trains a network through layer-wise local targets and losses, has been studied as an alternative to backpropagation (BP) in neural computation. However, its algorithms often become more complex or require additional hyperparameters because of the locality, making it challenging to identify desirable settings in which the algorithm progresses in a stable manner. To provide theoretical and quantitative insights, we introduce the maximal update parameterization ($μ$P) in the infinite-width limit for two representative designs of local targets: predictive coding (PC) and target propagation (TP). We verified that $μ$P enables hyperparameter transfer across models of different widths. Furthermore, our analysis revealed unique and intriguing properties of $μ$P that are not present in conventional BP. By analyzing deep linear networks, we found that PC's gradients interpolate between first-order and Gauss-Newton-like gradients, depending on the parameterization. We demonstrate that, in specific standard settings, PC in the infinite-width limit behaves more similarly to the first-order gradient. For TP, even with the standard scaling of the last layer, which differs from classical $μ$P, its local loss optimization favors the feature learning regime over the kernel regime.

Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation

TL;DR

This work addresses stable local learning via layer-wise targets and losses, focusing on Predictive Coding (PC) and Target Propagation (TP). It develops the maximal update parameterization (P) in the infinite-width limit for PC and TP, enabling hyperparameter transfer across widths (Transfer) and revealing that PC's gradients can interpolate between first-order GD and Gauss-Newton-like forms, while TP biases toward feature learning. The analysis shows that in deep linear networks PC's gradient behavior depends on parameterization and inference size, and that TP lacks a kernel regime due to the independent treatment of the feedback channel; these insights elucidate when local learning can emulate or diverge from BP. Collectively, the results provide a theoretical foundation for scalable local learning, offering practical guidance for cross-width hyperparameter transfer and highlighting fundamental differences between PC and TP in large networks.

Abstract

Local learning, which trains a network through layer-wise local targets and losses, has been studied as an alternative to backpropagation (BP) in neural computation. However, its algorithms often become more complex or require additional hyperparameters because of the locality, making it challenging to identify desirable settings in which the algorithm progresses in a stable manner. To provide theoretical and quantitative insights, we introduce the maximal update parameterization (P) in the infinite-width limit for two representative designs of local targets: predictive coding (PC) and target propagation (TP). We verified that P enables hyperparameter transfer across models of different widths. Furthermore, our analysis revealed unique and intriguing properties of P that are not present in conventional BP. By analyzing deep linear networks, we found that PC's gradients interpolate between first-order and Gauss-Newton-like gradients, depending on the parameterization. We demonstrate that, in specific standard settings, PC in the infinite-width limit behaves more similarly to the first-order gradient. For TP, even with the standard scaling of the last layer, which differs from classical P, its local loss optimization favors the feature learning regime over the kernel regime.

Paper Structure

This paper contains 56 sections, 9 theorems, 131 equations, 24 figures, 2 tables, 1 algorithm.

Key Result

Proposition 3.3

Consider the first one-step update by the GNT: $W_{l, 1} = W_{l,0} - \eta_l \phi '(W_{l, t} {h}_{l-1}) \circ (\delta_l \delta_l^{\top} + \rho I)^{-e_B} \delta_l \text{diag}({e}_L) h_{l-1}^{\top}$where $\delta_l = \nabla_{u_l} u_L$ and $e_L=y-h_L$. In the infinite-width limit, this update admits the where $\theta_l := 2 a_l + c_l$. We obtain $\mu$P of SGD for $e_B=0$, and that of GNT for $e_B=1$.

Figures (24)

  • Figure 1: $\mu$P enables the transfer of learning rates across widths. (Left) PC reduces to SGD when F-ini, FPA, and SI are applied. In fact, using the $\mu$P of SGD, learning rates are successfully transferred across different widths. (Right) Even without FPA, our $\mu$P of PC also allows $\mu$Transfer across widths. In this case, inference is performed only once, and the difference in test accuracy between $\bar{\gamma}_L = 0$ and $\bar{\gamma}_L = 1$ is small. Both figures show results with a 3-layer MLP on FashionMNIST.
  • Figure 2: (Left) Comparison of gradients with the analytical solution of a linear network.We measured the cosine similarity between the gradients analytically derived in Theorem \ref{['thm:linear-pc']} and the BP gradients or GN gradients for each layer.(a) As $M_L$ approaches 1, PC's gradient converges to BP's. (b) As $M_l$ increases, the PC gradient approaches BP's. (c)$\bar{\gamma}_L = 0$ yields gradients closer to BP gradient (which means SGD in this experiment) compared to $\bar{\gamma}_L = -1$. (Right) In a nonlinear MLP, PC's gradient also approaches BP's when $\bar{\gamma}_L = 0$.
  • Figure 3: (Left) $\bar{\gamma_L} = -1$ steadily reduces the local loss as width increases. We observed the inference loss in a randomly initialized linear network for various $\bar{\gamma}_L$. For $\bar{\gamma}_L = -1$, the inference loss consistently decreases with increasing width. (Right) The "wider is better" trend holds for $\mu$P with $\bar{\gamma_L} = -1$. With F-ini, this trend holds for $\mu$P regardless of the $\bar{\gamma}_L$ value. However, without F-ini, the benefits of $\bar{\gamma}_L = -1$ become particularly prominent.
  • Figure 4: (Left) $\mu$P can transfer the learning rate across widths (without F-ini). We trained a 3-layer CNN on FashionMNIST with 100 inference iterations. Without F-ini, the stability of the inference becomes more crucial. As a result, unlike the single-shot SI with F-ini shown in Figure \ref{['fig:fpa-pc-mup']}, the stability provided by $\bar{\gamma}_L = -1$ becomes critical. Note that additional experiments under different settings, including those with VGG5 (Figure \ref{['fig:vgg-transfer']}) and cross-entropy loss (Figure \ref{['fig:ce-pc-transfer']}), are presented in Section \ref{['sec:app-exp-mup-pc']} of the Appendix.(Right) $\Delta h$ remains consistent across widths during training. We confirm that the condition $\Delta h = \Theta(1)$ required by $\mu$P holds throughout the training.
  • Figure 5: $\mu$P with $\bar{\gamma}_L = -1$ performs consistently well, regardless of $\gamma_l$. When $\gamma_l$ is small ($\gamma_l=0.01$), $\mu$P with $\bar{\gamma}_L = 0$ performs poorly, while $\mu$P with $\bar{\gamma}_L = -1$ shows significantly better performance. This difference is likely due to slower inference convergence in $\mu$P with $\bar{\gamma}_L = 0$. For larger values of $\gamma_l$ ($\gamma_l=1$), both $\mu$P configurations exhibit high accuracy. However, for $\mu$P with $\bar{\gamma}_L = 0$, $\gamma_L$ does not transfer effectively across widths, whereas $\mu$P with $\bar{\gamma}_L = -1$ demonstrates the successful transfer of $\gamma_L$ across widths.
  • ...and 19 more figures

Theorems & Definitions (11)

  • Proposition 3.3: ishikawa2024on
  • Theorem 4.1: $\mu$P for PC (informal)
  • Theorem 4.2
  • Corollary 4.3
  • Theorem 5.1: $\mu$P for TP and DTP (informal)
  • Definition A.2: Stability of learning
  • Definition A.5: $\mu$P
  • Theorem B.1: Stable parameterization for PC
  • Lemma B.2
  • Theorem B.3
  • ...and 1 more