Table of Contents
Fetching ...

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Roman Worschech, Bernd Rosenow

TL;DR

This work employs techniques from statistical mechanics to analyze one-pass stochastic gradient descent within a student-teacher framework, and finds that the length of the symmetric plateau depends on the number of distinct eigenvalues of the data covariance matrix and the number of hidden units.

Abstract

Neural scaling laws describe how the performance of deep neural networks scales with key factors such as training data size, model complexity, and training time, often following power-law behaviors over multiple orders of magnitude. Despite their empirical observation, the theoretical understanding of these scaling laws remains limited. In this work, we employ techniques from statistical mechanics to analyze one-pass stochastic gradient descent within a student-teacher framework, where both the student and teacher are two-layer neural networks. Our study primarily focuses on the generalization error and its behavior in response to data covariance matrices that exhibit power-law spectra. For linear activation functions, we derive analytical expressions for the generalization error, exploring different learning regimes and identifying conditions under which power-law scaling emerges. Additionally, we extend our analysis to non-linear activation functions in the feature learning regime, investigating how power-law spectra in the data covariance matrix impact learning dynamics. Importantly, we find that the length of the symmetric plateau depends on the number of distinct eigenvalues of the data covariance matrix and the number of hidden units, demonstrating how these plateaus behave under various configurations. In addition, our results reveal a transition from exponential to power-law convergence in the specialized phase when the data covariance matrix possesses a power-law spectrum. This work contributes to the theoretical understanding of neural scaling laws and provides insights into optimizing learning performance in practical scenarios involving complex data structures.

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

TL;DR

This work employs techniques from statistical mechanics to analyze one-pass stochastic gradient descent within a student-teacher framework, and finds that the length of the symmetric plateau depends on the number of distinct eigenvalues of the data covariance matrix and the number of hidden units.

Abstract

Neural scaling laws describe how the performance of deep neural networks scales with key factors such as training data size, model complexity, and training time, often following power-law behaviors over multiple orders of magnitude. Despite their empirical observation, the theoretical understanding of these scaling laws remains limited. In this work, we employ techniques from statistical mechanics to analyze one-pass stochastic gradient descent within a student-teacher framework, where both the student and teacher are two-layer neural networks. Our study primarily focuses on the generalization error and its behavior in response to data covariance matrices that exhibit power-law spectra. For linear activation functions, we derive analytical expressions for the generalization error, exploring different learning regimes and identifying conditions under which power-law scaling emerges. Additionally, we extend our analysis to non-linear activation functions in the feature learning regime, investigating how power-law spectra in the data covariance matrix impact learning dynamics. Importantly, we find that the length of the symmetric plateau depends on the number of distinct eigenvalues of the data covariance matrix and the number of hidden units, demonstrating how these plateaus behave under various configurations. In addition, our results reveal a transition from exponential to power-law convergence in the specialized phase when the data covariance matrix possesses a power-law spectrum. This work contributes to the theoretical understanding of neural scaling laws and provides insights into optimizing learning performance in practical scenarios involving complex data structures.

Paper Structure

This paper contains 28 sections, 129 equations, 20 figures.

Figures (20)

  • Figure 1: Generalization error $\epsilon_g$ as a function of $\alpha$ for $K = M = 2$, $\eta = 0.1$, $\beta = 1$, $\sigma_J = 0.01$, and $N = 1024$, with varying numbers of distinct eigenvalues $L$. As $L$ increases, the plateau length decreases until it disappears. Additionally, with increasing $L$, the convergence of the asymptotic generalization error slows down, transitioning from exponential to power-law scaling in the early asymptotic phase.
  • Figure 2: Generatlization error $\epsilon_g$ for linear activation function. Left: $\epsilon_g$ evaluated using Eq. (\ref{['eps perceptron time']}) (blue) and Eq. (\ref{['eps perceptron time full']}) (orange) for $N = 128$, $K = M = 1$, $\sigma_J^2 = 1$, $\beta = 1$, and $\eta = 1$. Right: $\epsilon_g$ evaluated using Eq. (\ref{['linear time evolution integral']}) (dashed orange) compared to simulations of a student-teacher setup averaged over 5 random initializations (solid blue), with $N = L = 256$, $\beta = 0.75$, $\eta = 0.1$, and $\sigma_J = 0.01$.
  • Figure 3: Generalization error $\epsilon_g$ for different trainable input dimensions $N_l$ of the student network. Left: $\epsilon_g$ as a function of $\alpha$ for various $N_l$, with $L = N = 256$, $K = M = 1$, $\sigma_J = 0.01$, $\eta = 0.05$, and $\beta = 1$. The student network is trained on synthetic data and the teacher's outputs. Right: $\epsilon_g$ as a function of $\alpha$, with $L = N = 1024$, $K = M = 1$, $\sigma_J = 0.01$, and $\eta = 0.05$. The student network is trained on the CIFAR-5m dataset nakkiran2021the using the teacher's outputs. We estimate the scaling exponent $\beta \approx 0.3$ for this dataset. For the theoretical predictions, the empirical data spectrum is used to evaluate Eq. (\ref{['eg min Nl']}). Both plots compare the simulation results (solid curves) to the theoretical prediction from Eq. (\ref{['eg min Nl']}) (black dashed lines). For both plots, the generalization error is averaged over 50 random initializations of the student and teacher vectors.
  • Figure 4: Symmetric plateau for a non-linear activation function. Left and center: Plateau behavior of the order parameters for $L = 10$, $N = 7000$, $\sigma_J = 0.01$, $\eta = 0.1$, and $M = K = 4$, using one random initialization of the student and teacher vectors. We solve the differential equations in the small learning rate regime, retaining terms up to $\mathcal{O}(\eta)$. The insets display the higher-order order parameters at the plateau. For the student-teacher order parameters, we observe $M$ distinct plateau heights, while the student-student order parameters exhibit a single plateau height with minor statistical deviations in the matrix entries $Q_{ij}^{(l)}$. The dashed horizontal lines in the insets correspond to the plateau heights predicted by Eq. (\ref{['plateau fix 1']}). Right: Corresponding generalization error $\epsilon_g$ for the same setup. The vertical dashed lines indicate the estimated plateau length based on Eqs. (\ref{['Biehl plateau']}) and (\ref{['escape time']}).
  • Figure 5: Plateau behavior of the generalization error $\epsilon_g$ obtained by simulations for one random initialization of student and teacher vectors with $K=M$, $\eta=0.01$, $\sigma_J=10^{-6}$ and $\beta=0.25$. Left: $\epsilon_g$ for different student-teacher sizes for $L=N=512$. Right: $\epsilon_g$ for different numbers of distinct eigenvalues $L$ and for $K=M=6$ and $N=500$.
  • ...and 15 more figures