Table of Contents
Fetching ...

Applications of Statistical Field Theory in Deep Learning

Zohar Ringel, Noa Rubin, Edo Mor, Moritz Helias, Inbar Seroussi

TL;DR

This work surveys the application of statistical field theory to deep learning, arguing that a physics-inspired framework—built on path integrals, replicas, and large-width limits—can illuminate generalization, bias, and feature learning. It develops three analytic strands: (i) infinite-width Gaussian-process/NTK mappings that connect DNNs to GPR and kernel methods, (ii) field-theoretic treatments of data-averaged GPR via replicas and RG to capture dataset effects and scaling laws, and (iii) dynamical field theories (MSRDJ) for non-linear, finite-width networks to bridge equilibrium GP/NTK descriptions with time-dependent learning. These approaches yield concrete insights such as spectral bias, effective ridge renormalization, and kernel adaptation mechanisms that can explain when and how deep networks outperform their lazy or linear counterparts. The synthesis points to practical implications like hyperparameter transfer, scaling laws, and principled regularization via kernel dynamics, while outlining avenues for extending field-theoretic analyses to richer architectures and dynamics. Overall, the paper sketches a path toward a unifying theory of deep learning grounded in statistical physics, with tangible predictions for generalization and learning dynamics.

Abstract

Deep learning algorithms have made incredible strides in the past decade, yet due to their complexity, the science of deep learning remains in its early stages. Being an experimentally driven field, it is natural to seek a theory of deep learning within the physics paradigm. As deep learning is largely about learning functions and distributions over functions, statistical field theory, a rich and versatile toolbox for tackling complex distributions over functions (fields) is an obvious choice of formalism. Research efforts carried out in the past few years have demonstrated the ability of field theory to provide useful insights on generalization, implicit bias, and feature learning effects. Here we provide a pedagogical review of this emerging line of research.

Applications of Statistical Field Theory in Deep Learning

TL;DR

This work surveys the application of statistical field theory to deep learning, arguing that a physics-inspired framework—built on path integrals, replicas, and large-width limits—can illuminate generalization, bias, and feature learning. It develops three analytic strands: (i) infinite-width Gaussian-process/NTK mappings that connect DNNs to GPR and kernel methods, (ii) field-theoretic treatments of data-averaged GPR via replicas and RG to capture dataset effects and scaling laws, and (iii) dynamical field theories (MSRDJ) for non-linear, finite-width networks to bridge equilibrium GP/NTK descriptions with time-dependent learning. These approaches yield concrete insights such as spectral bias, effective ridge renormalization, and kernel adaptation mechanisms that can explain when and how deep networks outperform their lazy or linear counterparts. The synthesis points to practical implications like hyperparameter transfer, scaling laws, and principled regularization via kernel dynamics, while outlining avenues for extending field-theoretic analyses to richer architectures and dynamics. Overall, the paper sketches a path toward a unifying theory of deep learning grounded in statistical physics, with tangible predictions for generalization and learning dynamics.

Abstract

Deep learning algorithms have made incredible strides in the past decade, yet due to their complexity, the science of deep learning remains in its early stages. Being an experimentally driven field, it is natural to seek a theory of deep learning within the physics paradigm. As deep learning is largely about learning functions and distributions over functions, statistical field theory, a rich and versatile toolbox for tackling complex distributions over functions (fields) is an obvious choice of formalism. Research efforts carried out in the past few years have demonstrated the ability of field theory to provide useful insights on generalization, implicit bias, and feature learning effects. Here we provide a pedagogical review of this emerging line of research.

Paper Structure

This paper contains 46 sections, 191 equations, 4 figures.

Figures (4)

  • Figure 1: Gaussian Processes Regression on four 10k binary CIFAR and MNIST datasets, at $\kappa^2=1e-8$. Experimental results (dots) match well both the effective ridge theory and the RG theory. In the latter, we took $0.01$-learnability as marking the RG cut-off. We comment that results are similarly accurate for $T=0.001$ and $T=0.1$. The Equivalent Kernel estimator is expected to become accurate when the loss reaches the scale of $\kappa^2$, explaining its poor performance in the shown range of $P$.
  • Figure 2: Here we consider a single hidden layer linear network trained on a linear single index target, and compare the theoretical predictions of the kernel scaling approximation for the network output, as well those of the NNGP. We study two measures for the network output: (a) Learnability which we define as $\frac{f\cdot y}{ y \cdot y},$ which corresponds to the proportion of the target learned by the network, as well as (b) mean squared test error. Network Parameters: $d=50,N=1000$, each experimental point corresponds to an ensemble of $\sim$30 networks trained on different data seeds. Each network was trained until there was no visible change to the learnability, loss or hidden layer weight variance.
  • Figure 3: Learnability of linear CNNs as a function of $P$. We take $S,N,C \propto \alpha$, and consider different $\alpha$ scales of these parameters. Here the network is observed to learn the target at $P\propto d^{3/4}$, regardless of the parameter scale, as opposed to the GP predictions which predict learning at $P\propto d$. Parameters: $\chi=100$, $N=10\alpha,S=50\alpha,C=1000\alpha$.
  • Figure 4: In this figure we compare a linear network trained on a single index linear teacher, with an Erf network trained on a cubic single index teacher ($y(x)=w_* \cdot x +0.1 H_3(w_*\cdot x)$, where $H_3$ is the third Hermite polynomial). The ratio between the teacher direction eigenvalue of the kernel to the eigenvalues corresponding to orthogonal directions for the Erf and linear networks is shown in panels (a) and (b) respectively. In panels (c), (d) the learnability ($f\cdot y/y\cdot y$) is shown for the Erf and linear network respectively. Network parameters: $\chi=100$, $N_w=1,5,10,S=50,C=1000$.