Table of Contents
Fetching ...

Deriving Neural Scaling Laws from the statistics of natural language

Francesco Cagnetta, Allan Raventós, Surya Ganguli, Matthieu Wyart

TL;DR

The paper derives a parameter-free theory linking data-limited neural scaling exponents to intrinsic statistics of natural language. By decomposing the autoregressive loss into a data-dependent prediction horizon and within-horizon learning, and assuming power-law decays for next-token entropy ($H_n-H_\infty \sim n^{-\gamma}$) and token-token correlations ($\|C(n)\|_{\mathrm{op}} \sim n^{-\beta}$), it predicts the data-limited exponent $\alpha_D = \gamma/(2\beta)$ and a data-collapse master curve for $n$-gram losses. Empirical validation on TinyStories and WikiText shows consistent measurements of $\gamma$ and $\beta$ across model classes, with the observed autoregressive scaling aligning with the predicted exponent across architectures and datasets. The results imply a fundamental limit imposed by language statistics on data-limited learning and propose a universality class of horizon-limited learning for deep models, providing a practical way to forecast scaling behavior from corpus statistics.

Abstract

Despite the fact that experimental neural scaling laws have substantially guided empirical progress in large-scale machine learning, no existing theory can quantitatively predict the exponents of these important laws for any modern LLM trained on any natural language dataset. We provide the first such theory in the case of data-limited scaling laws. We isolate two key statistical properties of language that alone can predict neural scaling exponents: (i) the decay of pairwise token correlations with time separation between token pairs, and (ii) the decay of the next-token conditional entropy with the length of the conditioning context. We further derive a simple formula in terms of these statistics that predicts data-limited neural scaling exponents from first principles without any free parameters or synthetic data models. Our theory exhibits a remarkable match with experimentally measured neural scaling laws obtained from training GPT-2 and LLaMA style models from scratch on two qualitatively different benchmarks, TinyStories and WikiText.

Deriving Neural Scaling Laws from the statistics of natural language

TL;DR

The paper derives a parameter-free theory linking data-limited neural scaling exponents to intrinsic statistics of natural language. By decomposing the autoregressive loss into a data-dependent prediction horizon and within-horizon learning, and assuming power-law decays for next-token entropy () and token-token correlations (), it predicts the data-limited exponent and a data-collapse master curve for -gram losses. Empirical validation on TinyStories and WikiText shows consistent measurements of and across model classes, with the observed autoregressive scaling aligning with the predicted exponent across architectures and datasets. The results imply a fundamental limit imposed by language statistics on data-limited learning and propose a universality class of horizon-limited learning for deep models, providing a practical way to forecast scaling behavior from corpus statistics.

Abstract

Despite the fact that experimental neural scaling laws have substantially guided empirical progress in large-scale machine learning, no existing theory can quantitatively predict the exponents of these important laws for any modern LLM trained on any natural language dataset. We provide the first such theory in the case of data-limited scaling laws. We isolate two key statistical properties of language that alone can predict neural scaling exponents: (i) the decay of pairwise token correlations with time separation between token pairs, and (ii) the decay of the next-token conditional entropy with the length of the conditioning context. We further derive a simple formula in terms of these statistics that predicts data-limited neural scaling exponents from first principles without any free parameters or synthetic data models. Our theory exhibits a remarkable match with experimentally measured neural scaling laws obtained from training GPT-2 and LLaMA style models from scratch on two qualitatively different benchmarks, TinyStories and WikiText.
Paper Structure (22 sections, 41 equations, 12 figures)

This paper contains 22 sections, 41 equations, 12 figures.

Figures (12)

  • Figure 1: Measurable language statistics predict the exponents of data-limited neural scaling laws in language models.Top: The highly diverse $n$-gram losses $\mathcal{L}_n$ (\ref{['eq:ngram-loss']}) of a GPT-2–style transformer trained from scratch on $P$-tokens of the TinyStories dataset (left) collapse onto a single curve, when plotted in rescaled units (right). Here $\mathcal{L}_n$ is rescaled by $H_n \asymp n^{-\gamma}$, where $\gamma$ is the exponent of the power law temporal decay of $H_n$ with $n$, and $H_n$ is the next-token conditional entropy conditioned on the previous $n$ tokens. Also $P$ is rescaled by $n^{2\beta}$ where $\beta$ is the exponent governing the power-law decay of token-token correlations separated by temporal lag $n$. The entropy exponent $\gamma$ and the correlation exponent $\beta$ are strictly properties of the dataset, yet they completely control all neural $n$-gram learning curves $\mathcal{L}_n(P)$ through collapse. Bottom: We plot the autoregressive test loss $\mathcal{L}$ (which averages the $n$-gram losses over all $n$) as a function of $P$ for models trained with varying context size $T$. Our theory predicts that the exponent of the neural scaling law depends on language statistics alone via $\gamma$ and $\beta$ and is given by $\alpha_D = \gamma / (2\beta)$. Remarkably, our theoretical prediction (slope of the dashed black line) matches that of experimental neural scaling laws (colored lines), especially at larger context sizes $T$, as predicted by our theory.
  • Figure 2: Conditional entropy decay with time horizon defines a characteristic exponent $\gamma$ that is architecture-independent. We train three classes of models from scratch on $P$-token slices of the TinyStories dataset: GPT-2–style transformers with absolute positional embeddings (Left), GPT-2–style transformers with rotary positional embeddings (RoPE, Center Left), and LLaMA-style transformers (Center Right), training a separate model for each $P$. For each model class, we measure the $n$-gram loss $\mathcal{L}_n$ as a function of $n$, with curves colored by $P$. We define $\gamma$ by fitting a power law to the initial decay of $\mathcal{L}_n$ for the model trained on the largest $P$. As $P$ increases, the small-$n$ region of $\mathcal{L}_n$ starts to converge, and the fitted exponent stabilizes. Crucially, the resulting values of $\gamma$ are consistent across architectures (Right). Thus, for a given dataset, $\gamma$ is a property of the data distribution and can be estimated from a single sufficiently large, well-trained model.
  • Figure 3: Decay of two-point correlations as a function temporal separation defines a characteristic exponent $\beta$. The second dataset-level statistic we consider is the decay of the two-point correlation function, defined as the norm of the token-token co-occurrence matrix $C(n)_{\mu\nu} = \mathbb{P}(X_i=\mu, X_{i+n}=\nu) - \mathbb{P}(X_i=\mu)\mathbb{P}(X_{i+n}=\nu)$, with the time separation $n$. We define $\beta$ as the exponent of a power-law fit to this decay. $C(n)$ is estimated from empirical token co-occurrence counts on the full training set for TinyStories (Left) and WikiText (Right). We plot both the Frobenius and operator norms, which closely track each other, and use the operator norm to characterize the decay. For TinyStories, the power law holds over a broad range of time separations, while for WikiText we fit $\beta$ using the initial decay regime.
  • Figure 4: Same as \ref{['fig:main']}, but for GPT-2–style transformers trained on WikiText. As in \ref{['fig:main']}, the top row is for $T=128$; see \ref{['fig:wikitext-gpt2-T-512']} for $T=512$. Corresponding figures for GPT-2–style transformers with RoPE on WikiText are in \ref{['fig:wikitext-gpt2-rope']} and \ref{['fig:wikitext-gpt2-rope-T-512']}.
  • Figure 5: Fitting $\gamma$ for the WikiText dataset. (Similar to \ref{['fig:main-gamma']}, but for WikiText.) We train two classes of models from scratch on $P$-token slices of the WikiText dataset: GPT-2--style transformers with APE (Left) and GPT-2--style transformers with RoPE (Center). For each model class, $\gamma$ is obtained by fitting a power law to the initial decay of $\mathcal{L}_n$ for the model trained with the largest $P$. As in \ref{['fig:main-gamma']}, the resulting estimates of $\gamma$ are consistent across architectures (Right).
  • ...and 7 more figures