Table of Contents
Fetching ...

Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks

Blake Bordelon, Cengiz Pehlevan

TL;DR

This work develops a self-consistent dynamical mean-field theory (DMFT) for feature learning in infinite-width neural networks under gradient flow, capturing kernel evolution via a set of layer-wise kernels $\{\Phi^\ell, G^\ell\}$ and a tunable feature-learning strength $\gamma_0$. Using a path-integral MSRDJ formulation, the authors derive saddle-point DMFT equations that couple stochastic activation/gradient dynamics to kernel dynamics, recovering the Tensor Programs description at $\gamma_0=1$ and enabling a polynomial-time alternating Monte Carlo solver for nonlinear nets. In the deep linear case, the DMFT closes to algebraic matrix equations, while in nonlinear settings it yields practical self-consistent kernel evolution that matches finite-width behavior and reveals when common approximations fail. Experiments on CIFAR with CNNs show that, at fixed feature-learning strength, loss and kernel dynamics persist across widths, supporting the theory’s relevance for real-world deep learning regimes.

Abstract

We analyze feature learning in infinite-width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training. These kernel order parameters collectively define the hidden layer activation distribution, the evolution of the neural tangent kernel, and consequently output predictions. We show that the field theory derivation recovers the recursive stochastic process of infinite-width feature learning networks obtained from Yang and Hu (2021) with Tensor Programs . For deep linear networks, these kernels satisfy a set of algebraic matrix equations. For nonlinear networks, we provide an alternating sampling procedure to self-consistently solve for the kernel order parameters. We provide comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory, showing that each of these approximations can break down in regimes where general self-consistent solutions still provide an accurate description. Lastly, we provide experiments in more realistic settings which demonstrate that the loss and kernel dynamics of CNNs at fixed feature learning strength is preserved across different widths on a CIFAR classification task.

Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks

TL;DR

This work develops a self-consistent dynamical mean-field theory (DMFT) for feature learning in infinite-width neural networks under gradient flow, capturing kernel evolution via a set of layer-wise kernels and a tunable feature-learning strength . Using a path-integral MSRDJ formulation, the authors derive saddle-point DMFT equations that couple stochastic activation/gradient dynamics to kernel dynamics, recovering the Tensor Programs description at and enabling a polynomial-time alternating Monte Carlo solver for nonlinear nets. In the deep linear case, the DMFT closes to algebraic matrix equations, while in nonlinear settings it yields practical self-consistent kernel evolution that matches finite-width behavior and reveals when common approximations fail. Experiments on CIFAR with CNNs show that, at fixed feature-learning strength, loss and kernel dynamics persist across widths, supporting the theory’s relevance for real-world deep learning regimes.

Abstract

We analyze feature learning in infinite-width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training. These kernel order parameters collectively define the hidden layer activation distribution, the evolution of the neural tangent kernel, and consequently output predictions. We show that the field theory derivation recovers the recursive stochastic process of infinite-width feature learning networks obtained from Yang and Hu (2021) with Tensor Programs . For deep linear networks, these kernels satisfy a set of algebraic matrix equations. For nonlinear networks, we provide an alternating sampling procedure to self-consistently solve for the kernel order parameters. We provide comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory, showing that each of these approximations can break down in regimes where general self-consistent solutions still provide an accurate description. Lastly, we provide experiments in more realistic settings which demonstrate that the loss and kernel dynamics of CNNs at fixed feature learning strength is preserved across different widths on a CIFAR classification task.
Paper Structure (63 sections, 176 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 63 sections, 176 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Neural network feature learning dynamics is captured by self-consistent dynamical mean field theory (DMFT). (a) Training loss curves on a subsample of $P=10$ CIFAR-10 training points in a depth 4 ($L=3$, $N=2500$) tanh network ($\phi(h) = \tanh(h)$) trained with MSE. Increasing $\gamma_0$ accelerates training. (b)-(c) The distribution of preactivations at the beginning and end of training matches predictions of the DMFT. (d) The final $\Phi^\ell$ (at $t=100$) kernel order parameters match the finite width network. (e) The temporal dynamics of the sample-traced kernels $\sum_{\mu} \Phi_{\mu\mu}^{\ell}(t,s)$ matches experiment and reveals rich dynamics across layers. (f) The alignment $A(\bm\Phi^\ell_{DMFT}, \bm\Phi^\ell_{NN})$, defined as cosine similarity, of the kernel $\Phi^\ell_{\mu\alpha}(t,s)$ predicted by theory (DMFT) and width $N$ networks for different $N$ but fixed $\gamma_0 = \gamma/\sqrt{N}$. Errorbars show standard deviation computed over $10$ repeats. Around $N \sim 500$ DMFT begins to show near perfect agreement with the NN. (g)-(i) The same plots but for the gradient kernel $\bm G^\ell$. Whereas finite width effects for $\bm\Phi^\ell$ are larger at later layers $\ell$ since variance accumulates on the forward pass, fluctuations in $\bm G^\ell$ are large in early layers.
  • Figure 2: Deep linear network with the full DMFT. (a) The train loss for NNs of varying $L$. (b) For a $L=5, N=1000$ NN, the kernels $H^\ell$ at the end of training compared to DMFT theory on $P=20$ datapoints. (c) The average displacement of feature kernels for different depth networks at same $\gamma_0$ value. For equal values of $\gamma_0$, deeper networks exhibit larger changes to their features, manifested in lower alignment with their initial $t=0$ kernels $\bm H$. (d) The solution to the temporal components of the $G^\ell(t,s)$ and $\sum_{\mu}H^\ell_{\mu\mu}(t,s)$ kernels obtained from the self-consistent equations.
  • Figure 3: Width $N=1000$ ReLU networks trained with L2 regularization have nontrivial fixed point in DMFT limit ($\gamma_0 > 0$). (a) Training loss dynamics for a $L=1$ ReLU network with $\lambda = 1$. In $\gamma_0 \to 0$ limit the fixed point is trivial $f = K = 0$. The final loss is a decreasing function of $\gamma_0$. (b) The final kernel is more aligned with target with increasing $\gamma_0$. Networks with homogenous activations enjoy a representer theorem at infinite-width as we show in Appendix \ref{['app:weight_decay']}.
  • Figure 4: Comparison of DMFT to various approximation schemes in a $L=5$ hidden layer, width $N=1000$ linear network with $\gamma_0 = 1.0$ and $P=100$. (a) The loss for the various approximations do not track the true trajectory induced by gradient descent in the large $\gamma_0$ regime. (b)-(c) The feature kernels $H^\ell_{\mu\alpha}(t,s)$ across each of the $L=5$ hidden layers for each of the theories is compared to a width $1000$ neural network. Again, we plot the sample-traced dynamics $\sum_{\mu\mu} H^\ell_{\mu\mu}(t,s)$. (d) The alignment of $\bm H^\ell$ compared to the finite NN $A(\bm H^\ell, \bm H^\ell_{NN})$ averaged across $\ell \in \{1,...,5\}$ for varying $\gamma$. The predictions of all of these theories coincide in the $\gamma_0 = 0$ limit but begin to deviate in the feature learning regime. Only the non-perturbative DMFT is accurate over a wide range of $\gamma_0$.
  • Figure 5: The dynamics of a depth $5$ ($L=4$ hidden) CNNs trained on first two classes of CIFAR (boat vs plane) exhibit consistency for different channel counts $N \in \{250,500\}$ for fixed $\gamma_0 = \gamma / \sqrt{N}$. (a) We plot the test loss (MSE) and (b) test classification error. Networks with higher $\gamma_0$ train more rapidly. Time is measured in every $100$ update steps. (c) The dynamics of the last layer feature kernel $\Phi^L$, shown as alignment to the target function. As predicted by the DMFT, higher $\gamma_0$ corresponds to more active kernel evolution, evidenced by larger change in the alignment.
  • ...and 5 more figures