Table of Contents
Fetching ...

Wilsonian Renormalization of Neural Network Gaussian Processes

Jessica N. Howard, Ro Jefferson, Anindita Maiti, Zohar Ringel

TL;DR

This work introduces a Wilsonian Renormalization Group framework for Gaussian Process regression to study learnable versus unlearnable neural network features. By integrating out high-frequency kernel modes, it derives an RG flow in which the ridge parameter $\sigma^2$ renormalizes and, in non-Gaussian settings, becomes input-dependent, linking scale separation to generalization behavior. The approach yields tractable equations in the Gaussian case and connects to neural scaling laws through the kernel eigen-spectrum, with empirical validation on MNIST and CIFAR10 that matches observed MSE scaling trends. Extensions to non-Gaussian feature distributions reveal spatial reweighting of the loss and a functional RG flow for input-dependent regularization, providing a path toward universality classifications in deep learning and potential insights into feature learning in large neural networks.

Abstract

Separating relevant and irrelevant information is key to any modeling process or scientific inquiry. Theoretical physics offers a powerful tool for achieving this in the form of the renormalization group (RG). Here we demonstrate a practical approach to performing Wilsonian RG in the context of Gaussian Process (GP) Regression. We systematically integrate out the unlearnable modes of the GP kernel, thereby obtaining an RG flow of the GP in which the data sets the IR scale. In simple cases, this results in a universal flow of the ridge parameter, which becomes input-dependent in the richer scenario in which non-Gaussianities are included. In addition to being analytically tractable, this approach goes beyond structural analogies between RG and neural networks by providing a natural connection between RG flow and learnable vs. unlearnable modes. Studying such flows may improve our understanding of feature learning in deep neural networks, and enable us to identify potential universality classes in these models.

Wilsonian Renormalization of Neural Network Gaussian Processes

TL;DR

This work introduces a Wilsonian Renormalization Group framework for Gaussian Process regression to study learnable versus unlearnable neural network features. By integrating out high-frequency kernel modes, it derives an RG flow in which the ridge parameter renormalizes and, in non-Gaussian settings, becomes input-dependent, linking scale separation to generalization behavior. The approach yields tractable equations in the Gaussian case and connects to neural scaling laws through the kernel eigen-spectrum, with empirical validation on MNIST and CIFAR10 that matches observed MSE scaling trends. Extensions to non-Gaussian feature distributions reveal spatial reweighting of the loss and a functional RG flow for input-dependent regularization, providing a path toward universality classifications in deep learning and potential insights into feature learning in large neural networks.

Abstract

Separating relevant and irrelevant information is key to any modeling process or scientific inquiry. Theoretical physics offers a powerful tool for achieving this in the form of the renormalization group (RG). Here we demonstrate a practical approach to performing Wilsonian RG in the context of Gaussian Process (GP) Regression. We systematically integrate out the unlearnable modes of the GP kernel, thereby obtaining an RG flow of the GP in which the data sets the IR scale. In simple cases, this results in a universal flow of the ridge parameter, which becomes input-dependent in the richer scenario in which non-Gaussianities are included. In addition to being analytically tractable, this approach goes beyond structural analogies between RG and neural networks by providing a natural connection between RG flow and learnable vs. unlearnable modes. Studying such flows may improve our understanding of feature learning in deep neural networks, and enable us to identify potential universality classes in these models.
Paper Structure (15 sections, 92 equations, 6 figures)

This paper contains 15 sections, 92 equations, 6 figures.

Figures (6)

  • Figure 1: Predicting empirical MSE loss scaling on MNIST and CIFAR10 regression tasks. For two real-world dataset examples (MNIST and CIFAR10), we show different theoretical predictions for the MSE loss as a function of the number of datapoints, $\eta$. For each dataset, we consider two different regression tasks. For MNIST we consider both '8-9' and '0-1', where $N=9,837$ and $N=10,564$, respectively. For CIFAR10 we consider both 'ship-truck' and 'automobile-bird', where $N=10,000$ in both cases. For all experiments, we fix $\sigma^2=10^{-8}$. We find that both the state-of-the-art Spectral Bias theory canatar and the RG theory introduced in this work predict the semi power-law behavior well. Whereas, the EK approximation in which there is no noise renormalization (i.e. $\sigma_{\rm eff}^2 = \sigma^{2}$) fails to accurately predict this behavior.
  • Figure 2: Illustrative depiction that the feature modes of real-world datasets are often Gaussian distributed. For two real-world dataset examples (MNIST and CIFAR10), we select two feature modes and show their joint distribution along with the best Gaussian fit to each marginal. This is done for each learning task considered. Namely, '8-9' and '0-1' in the MNIST case and 'ship-truck' and 'automobile-bird' in the CIFAR10 case. As previously noted in Simon2021, we can observe the interesting fact that the feature modes of real-world datasets are often (jointly) Gaussian distributed. Appendix \ref{['app:gaussian_vs_cauchy']} explores the extent to which this varies as a function of $k$.
  • Figure 3: Non-Gaussian features and spatial re-weighting effects. Theory versus experiment for the model of sec. \ref{['sec:toy']} with $n=100$ datapoints $\sigma^2=400$ and $\lambda_>:=\lambda_2=0.1$ (unless stated otherwise). Learning a 5th Hermite polynomial, using a kernel capable of expressing only 1st and 2nd Hermite polynomials should give a zero average predictor (green line) based on the standard theory canatarCohen_2021. However, due to spatial re-weighting, a coupling between 1st and 5th Hermite polynomial arises leading to a non-zero result. For both $\lambda_>=0$ and $\lambda_>=0.1$, $m = 5$ million trials are performed. The average and standard error (i.e. standard deviation/$\sqrt{m}$) are reported.
  • Figure 4: Effect of varying the learnability threshold, $T$. For the two real-world dataset examples (MNIST and CIFAR10) considered in section \ref{['sec:scalingLaws_empirical']}, we show the effect of the choice of learnability threshold $T\in (0,1)$. The choice of $T$ inadvertently determines the learnability cutoff, $\kappa$, through finding $\kappa$ such that the learnability factor $L_\kappa \approx T$ (see equation \ref{['eq:learnabilityfactor']}). While this choice slightly changes the RG theory prediction at high $\eta$, the effect is relatively minor.
  • Figure 5: Example distribution of high-$k$ feature modes. By comparing with figure \ref{['fig:empirical_results_scalinglaws_gaussianfeaturesproof']}, we can see that the feature modes shift from being best described by a Gaussian distribution to being better described by a Cauchy distribution for high $k$ values. The best fit Gaussian and best fit Cauchy distributions are overlaid.
  • ...and 1 more figures