Table of Contents
Fetching ...

A Solvable Model of Neural Scaling Laws

Alexander Maloney, Daniel A. Roberts, James Sully

TL;DR

This work proposes a solvable theory for neural scaling laws by coupling a latent-space data-generating process with a nonlinear random-feature map and solving the resulting generalized linear regression problem using planar random-matrix theory. The analysis shows that power-law test-loss scaling emerges when the effective spectrum of the data representation follows a heavy-tailed, power-law form and when neither data nor parameter resources bottlenecks the other. Key insights include how nonlinear feature maps extend the spectral bulk, the critical role of the latent-space dimension M, and the equiparameterization regime as optimal under regularization; breakdowns occur when M is not the largest scale. The framework also clarifies limitations of the data model, connects to prior RMT approaches, and suggests directions for incorporating representation learning and more complex (quadratic) models to capture broader scaling phenomena.

Abstract

Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. By solving this model in the dual limit of large training set size and large number of parameters, we gain insight into (i) the statistical structure of datasets and tasks that lead to scaling laws, (ii) the way nonlinear feature maps, such as those provided by neural networks, enable scaling laws when trained on these datasets, (iii) the optimality of the equiparameterization scaling of training sets and parameters, and (iv) whether such scaling laws can break down and how they behave when they do. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps and then translated into power-law scalings of the test loss and how the finite extent of the data's spectral power law causes the model's performance to plateau.

A Solvable Model of Neural Scaling Laws

TL;DR

This work proposes a solvable theory for neural scaling laws by coupling a latent-space data-generating process with a nonlinear random-feature map and solving the resulting generalized linear regression problem using planar random-matrix theory. The analysis shows that power-law test-loss scaling emerges when the effective spectrum of the data representation follows a heavy-tailed, power-law form and when neither data nor parameter resources bottlenecks the other. Key insights include how nonlinear feature maps extend the spectral bulk, the critical role of the latent-space dimension M, and the equiparameterization regime as optimal under regularization; breakdowns occur when M is not the largest scale. The framework also clarifies limitations of the data model, connects to prior RMT approaches, and suggests directions for incorporating representation learning and more complex (quadratic) models to capture broader scaling phenomena.

Abstract

Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. By solving this model in the dual limit of large training set size and large number of parameters, we gain insight into (i) the statistical structure of datasets and tasks that lead to scaling laws, (ii) the way nonlinear feature maps, such as those provided by neural networks, enable scaling laws when trained on these datasets, (iii) the optimality of the equiparameterization scaling of training sets and parameters, and (iv) whether such scaling laws can break down and how they behave when they do. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps and then translated into power-law scalings of the test loss and how the finite extent of the data's spectral power law causes the model's performance to plateau.
Paper Structure (24 sections, 231 equations, 19 figures)

This paper contains 24 sections, 231 equations, 19 figures.

Figures (19)

  • Figure 1: Cartoon plot of the empirical scaling laws discovered by Ref. kaplan2020scaling demonstrating that the test loss of LLMs trained with early stopping are predictably described by a simple phenomenological model, \ref{['eq:phenomenological-loss-original']}, plotted as a function of dataset size, $T$, for different model sizes, $N = \{N_0, N_0^2,N_0^3,N_0^4 \}$: if the model isn't bottlenecked by the number of parameters ($N \to \infty$), the test loss behaves as a power law in the training set size, $\mathcal{L}(N, T) \sim T^{-\alpha_T}$; otherwise, if the number of parameters is too small for a given training set, then the test loss stalls at a plateau at a value that depends predictably on the parameters, $\mathcal{L}(N, T) \sim N^{-\alpha_N}$. Similar statements hold reversing the role of the training set and parameter resources, and scaling both training set and parameters jointly with relative ratio $N \sim T^{\alpha_{T}/\alpha_{N}}$ ensures the overall best performance.
  • Figure 2: Log-log plot of example spectra for different dataset sizes, $T$, from different data domains. Increasing the dataset size, $T$, increases the extent of the approximate power-law fit (dashed line) so long as $T < N_{\text{in}}$. Left: CIFAR-10 cifar-dataset, a CV dataset of $32 \times 32$-pixel natural color images. The $3$ color channels bring the total number of input features per image is $N_{\text{in}} = 3 \times 32 \times 32= 3072$. Right: WikiText, an NLP dataset taken from the verified Good and Featured articles on Wikipedia merity2016pointer. The input data was tokenized and then embedded using Hugging Face's implementation wolf2019huggingface of GPT-2 radford2019language, and the embedding we use has dimension $N_{\text{in}} = 768$.
  • Figure 3: Example spectra from different data domains for a fixed dataset size and subsampled input features. (For a more detailed description of the datasets, see the caption of Fig. \ref{['fig:spectrums']}.) Increasing the number of input features in the subsample extends the length of the approximate power-law fit for the bulk (dashed line). Left: CIFAR-10, with pixels subsampled from the total $3072$ input features and a dataset size of $T=3072$. Right: WikiText, with the components of the embedding subsampled from the total $768$-dimensional embedding vector for each token and a dataset size of $T=768$.
  • Figure 4: Spectra of the feature representation from CIFAR-10 ($N_{\text{in}} = 3072$) of a fixed dataset ($T = 15000$), with an approximate power-law fit for the bulk (dashed line). Left: A linear map, \ref{['eq:linear-feature-map']}, does not extend the length of the approximate power-law fit. Right: For a nonlinear map, a ReLU activation applied after a linear map, \ref{['eq:simplest-feature-map']}, increasing the number of features, $N$, increases the extent of the approximate power-law fit. This extension is limited by the dataset size, $T$.
  • Figure 5: Spectrum $\lambda_I$ from numerical simulations (stars) of our latent data generative model, \ref{['eq:feature-feature-covariance-definition']} and \ref{['eq:exact-power-law-latent-generative']}, with the maximum eigenvalue fixed ($\lambda_+ =1$). Left: The size of the dataset, $T$, is varied while the size of the latent space and the power-law exponent are fixed ($M = 1000$, $\alpha =1$). These spectra follow a pattern similar to the ones displayed in Fig. \ref{['fig:spectrums']} for natural data: for dataset size smaller than the size of the latent space ($T <M$, blue and orange) the spectrum has a bulk power law portion that terminates in a very rapid decline ($\lambda_I \to 0$) as the index approaches the size of the dataset ($I \to T$), and the extent of the power law increases with increasing dataset; for dataset size equal to and greater than the size of the latent space ($T \geq M$, green and red), the power law terminates at the size of the latent space, but the rapid decline becomes sharper and sharper as the size of the dataset increases, forming a kink in the limit of infinite data ($T \to \infty$, dashed black line). Right: The power-law exponent, $\alpha$, is varied as the sizes of the dataset and latent space are held fixed ($T=1000$, $M=2000$), and the spectrum for infinite data is plotted for comparison (dashed lines). As all three simulations have the same size datasets, their power laws all terminate at the same point ($T=1000$).
  • ...and 14 more figures