Table of Contents
Fetching ...

Renormalization group for deep neural networks: Universality of learning and scaling laws

Gorka Peraza Coppola, Moritz Helias, Zohar Ringel

TL;DR

This work introduces a Wilsonian renormalization group framework for deep neural networks trained on power-law data, linking learning curves to self-similar structure in the data and kernel spectra. Starting from the infinite-width Gaussian Process limit, it adds weak non-Gaussian interactions to capture finite-width effects and develops a continuum, scale-aware RG in kernel (momentum) space that accounts for spectral discreteness and lack of translation invariance. A key result is the replacement of traditional scaling dimensions with scaling intervals, along with a Gaussian-process–like UV fixed point that governs large-data universality and leads to explicit neural-scaling laws with calculable corrections. The framework yields concrete predictions for mean predictors, variances, hyperparameter transfer across model sizes, and the dependence of learning curves on dataset size, with non-Gaussian corrections decaying as $P$ grows, thereby connecting critical phenomena methods to practical learning dynamics and resource estimates.

Abstract

Self-similarity, where observables at different length scales exhibit similar behavior, is ubiquitous in natural systems. Such systems are typically characterized by power-law correlations and universality, and are studied using the powerful framework of the renormalization group (RG). Intriguingly, power laws and weak forms of universality also pervade real-world datasets and deep learning models, motivating the application of RG ideas to the analysis of deep learning. In this work, we develop an RG framework to analyze self-similarity and its breakdown in learning curves for a class of weakly non-linear (non-lazy) neural networks trained on power-law distributed data. Features often neglected in standard treatments -- such as spectrum discreteness and lack of translation invariance -- lead to both quantitative and qualitative departures from conventional perturbative RG. In particular, we find that the concept of scaling intervals naturally replaces that of scaling dimensions. Despite these differences, the framework retains key RG features: it enables the classification of perturbations as relevant or irrelevant, and reveals a form of universality at large data limits, governed by a Gaussian Process-like UV fixed point.

Renormalization group for deep neural networks: Universality of learning and scaling laws

TL;DR

This work introduces a Wilsonian renormalization group framework for deep neural networks trained on power-law data, linking learning curves to self-similar structure in the data and kernel spectra. Starting from the infinite-width Gaussian Process limit, it adds weak non-Gaussian interactions to capture finite-width effects and develops a continuum, scale-aware RG in kernel (momentum) space that accounts for spectral discreteness and lack of translation invariance. A key result is the replacement of traditional scaling dimensions with scaling intervals, along with a Gaussian-process–like UV fixed point that governs large-data universality and leads to explicit neural-scaling laws with calculable corrections. The framework yields concrete predictions for mean predictors, variances, hyperparameter transfer across model sizes, and the dependence of learning curves on dataset size, with non-Gaussian corrections decaying as grows, thereby connecting critical phenomena methods to practical learning dynamics and resource estimates.

Abstract

Self-similarity, where observables at different length scales exhibit similar behavior, is ubiquitous in natural systems. Such systems are typically characterized by power-law correlations and universality, and are studied using the powerful framework of the renormalization group (RG). Intriguingly, power laws and weak forms of universality also pervade real-world datasets and deep learning models, motivating the application of RG ideas to the analysis of deep learning. In this work, we develop an RG framework to analyze self-similarity and its breakdown in learning curves for a class of weakly non-linear (non-lazy) neural networks trained on power-law distributed data. Features often neglected in standard treatments -- such as spectrum discreteness and lack of translation invariance -- lead to both quantitative and qualitative departures from conventional perturbative RG. In particular, we find that the concept of scaling intervals naturally replaces that of scaling dimensions. Despite these differences, the framework retains key RG features: it enables the classification of perturbations as relevant or irrelevant, and reveals a form of universality at large data limits, governed by a Gaussian Process-like UV fixed point.

Paper Structure

This paper contains 49 sections, 294 equations, 13 figures.

Figures (13)

  • Figure 1: Initial conditions for the flow equation and the meaning of the cutoff. One may choose different cutoffs, denoted as $\Lambda_{0}$ and $\Lambda_{1}$ with corresponding initial conditions $(s_{0},r_{0},U_{0})$ and $(s_{1},r_{1},U_{1})$, respectively. The two corresponding systems behave the same in the low momentum range for any $k=\Lambda_{0}/\ell$, if the two systems' parameters lie on the same RG trajectory. This also allows the definition of the $\Lambda\to\infty$ limit.
  • Figure 2: Renormalization group flow field. Renormalization group flow field. given by \ref{['eq:RG_U_main']} and \ref{['eq:RG_r_main']} at different decimation scales $\tau:=\ln\,\ell$. The plot represents the normalized rate of change of the vector field $(U,r)\mapsto(U+\mathsf{d}U,r+\mathsf{d}r)/N(U,r)$, where $\mathsf{d}U$ and $\mathsf{d}r$ follow the flow equations \ref{['eq:RG_U_main']} and \ref{['eq:RG_r_main']}, respectively, and $N(U,r)$ denotes the vector norm at each point. Parameters are set to $\alpha=0.2$ with the special choice $\beta=1+\alpha$ which ensures that $s(\ell)\equiv1$. The color scale indicates the logarithmic magnitude of the flow vectors, $\ln\,N(U,r)$. Black dashed curves show the $r$-nullcline given by eq:nullcline_limit_r.
  • Figure 3: Predictor in non-Gaussian regression.a Upper panel: Numerical result for the mean discrepancy $\langle\Delta\rangle$ (red dots) obtained by Langevin sampling until equilibrium. Gaussian process prediction that neglects the non-Gaussian terms (orange). Perturbative result to linear order in $U$ (black dashed). Prediction of Gaussian discrepancy $\sqrt{P}\,z(\ell)\,d(\Lambda,\ell)|_{\ell=\Lambda/k}$ from the renormalized theory (blue). Lower panel: Difference between numerical result and prediction from RG. b Corresponding numerical and theoretical results for the variance $\langle\Delta^{2}\rangle^{c}$, same color code as in panel a. Other parameters: $\alpha=0.2$, $\beta=0.3$, $U=0.05$. Numerical results obtained by sampling the learning dynamics for $T=50\cdot10^{6}$ steps with time resolution $\delta t=10^{-4}$, measuring each $\Delta T=10$ steps and an initial equilibration time of $T0=20\cdot10^{3}$ steps.
  • Figure 4: Hyperparameter transfer. Upper panel: Numerical result for the mean discrepancy $\langle\Delta\rangle$ (dots) obtained by Langevin sampling until equilibrium. Three systems are compared. System 1 was trained with $P=1000$ samples and ridge parameter $r_{0}=P/\kappa$ (blue). System 2 was trained with $P^{\prime}=500$ samples and ridge parameter $r_{0}^{\prime}=P^{\prime}/\kappa$ (green); system 1 and 2 both use the same $\kappa\simeq3.98$ and $U_{0}=0.05$. System 3 (red) was trained with $P^{\prime}=500$ samples but with parameters $\tilde{r}$ and $\tilde{U}$ given by \ref{['eq:hyperparameter_transform']} so that it resides on the same RG trajectory as system 1. Prediction of the discrepancy $\sqrt{P}\,z(\ell)\,d(\Lambda,\ell)|_{\ell=\Lambda/k}$ from the renormalized theory (black) for system 1. Lower panel: Difference between numerical results (system 1 and system 3) and prediction from RG. Other parameters: $\alpha=0.2$, $\beta=0.3$. Numerical results obtained by sampling the learning dynamics for $T=50\cdot10^{6}$ ($T=55\cdot10^{6}$ for the red dots to suppress noise) steps with time resolution $\delta t=10^{-4}$, measuring each $\Delta T=10$ steps and an initial equilibration time of $T_{0}=20\cdot10^{3}$ steps.
  • Figure 5: Neural scaling law. Expected training loss $\langle\mathcal{L}\rangle/P$ per data sample of a system trained with power-law exponents $\alpha=0.2$ and $\beta=0.3$. Full lines describe the prediction from the discrete sum eq:loss_discrete obtained by solving the full flow equations eq:RG_r_main and eq:RG_U_main numerically for the Gaussian process ($U=0$, full blue curve) and the non-Gaussian process ($U=0.05$, full red curve). The dashed lines describe the analytical continuous approximation eq:L_bias_GP_main_continuous and eq:L_var_GP_main_continuous for the Gaussian process (dashed blue curve) and the continuous approximation eq:L_cont_main_non_GP for the non-Gaussian process (dashed red curve).
  • ...and 8 more figures