Table of Contents
Fetching ...

A Dynamical Model of Neural Scaling Laws

Blake Bordelon, Alexander Atanasov, Cengiz Pehlevan

TL;DR

This work presents a tractable dynamical model that reproduces key neural scaling-law phenomena by analyzing a random-feature teacher-student setup with gradient-flow training through dynamical mean-field theory. It reveals how test loss scales as a power-law in time, model size, and data, and derives an asymmetric compute-optimal strategy that increases training steps faster than parameter count. The framework also explains early universal 1/width convergence, late-time task-dependent exponents, data-reuse induced overfitting buildup, and why ensembling is not always compute-optimal, with validation on realistic CIFAR-5M and Wikitext-like tasks. Overall, the DMFT approach links spectral properties of the data distribution to dynamic learning curves, offering a principled lens on compute-efficient training and the role of feature learning in accelerating scaling.

Abstract

On a variety of tasks, the performance of neural networks predictably improves with training time, dataset size and model size across many orders of magnitude. This phenomenon is known as a neural scaling law. Of fundamental importance is the compute-optimal scaling law, which reports the performance as a function of units of compute when choosing model sizes optimally. We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. This reproduces many observations about neural scaling laws. First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinite-width dynamics at a rate $1/\textit{width}$ but at late time exhibit a rate $\textit{width}^{-c}$, where $c$ depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.

A Dynamical Model of Neural Scaling Laws

TL;DR

This work presents a tractable dynamical model that reproduces key neural scaling-law phenomena by analyzing a random-feature teacher-student setup with gradient-flow training through dynamical mean-field theory. It reveals how test loss scales as a power-law in time, model size, and data, and derives an asymmetric compute-optimal strategy that increases training steps faster than parameter count. The framework also explains early universal 1/width convergence, late-time task-dependent exponents, data-reuse induced overfitting buildup, and why ensembling is not always compute-optimal, with validation on realistic CIFAR-5M and Wikitext-like tasks. Overall, the DMFT approach links spectral properties of the data distribution to dynamic learning curves, offering a principled lens on compute-efficient training and the role of feature learning in accelerating scaling.

Abstract

On a variety of tasks, the performance of neural networks predictably improves with training time, dataset size and model size across many orders of magnitude. This phenomenon is known as a neural scaling law. Of fundamental importance is the compute-optimal scaling law, which reports the performance as a function of units of compute when choosing model sizes optimally. We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. This reproduces many observations about neural scaling laws. First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinite-width dynamics at a rate but at late time exhibit a rate , where depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
Paper Structure (73 sections, 157 equations, 12 figures)

This paper contains 73 sections, 157 equations, 12 figures.

Figures (12)

  • Figure 1: Train and test losses (cross-entropy) as a function of training time $t$ and width $N$. For models trained online, we do not make a distinction between training and test error because each new batch is drawn fresh and would have the same loss in expectation as an independent test set. (a) The test loss of a residual CNN on CIFAR-5M is well described by a fit of the form $\mathcal{L} \sim t^{-\alpha} + N^{-\beta}$ in the online training regime. (b) The compute optimal strategy requires scaling up both model size and training time simultaneously. (c) Transformer training on wikitext with 100M tokens before data-repetition. Model performance is monotonic in width $N$. (d) Wikitext with 5M subsampled tokens. Larger width $N$ is not always better as wider models can overfit.
  • Figure 2: Verification of the various bottleneck scalings for power-law features with $a = 1.5$ and $b=1.25$. Dashed black lines are DMFT solutions while colors are simulations with standard deviation highlighted. (a) The loss dynamics at large $\alpha$ will be bottlenecked by either time or finite $\nu$. (b) Early in training, the loss converges to its limit as $N^{-1}$ (App. \ref{['app:early_time']}). (c) At long times, the model's asymptotic loss scales as $N^{-(a-1)}$ (App. \ref{['app:final_value_dmft']}). (d)-(f) The same results but for $N$ and $P$ switched. The model exhibits $1/P$ corrections and early time and power law data bottleneck scalings at late time.
  • Figure 3: Our DMFT can also capture online SGD learning including the effect of batch size fluctuations on the loss and the finite $N$ bottleneck. (a) Power law features trained with SGD and a fixed random projection still generates asymptotes which depend on $N$. (b) The batchsize $B$ impacts the loss through additional variance in the dynamics but does not lead to an asymptotic plateau.
  • Figure 4: Compute optimal scaling in our model is determined by tradeoff of time and model-size bottlenecks. Solid colored lines are simulations with power law features and in dashed black is the theoretical prediction of compute optimal scaling. Each color represents varying model sizes with $N \in [2^5, 2^{10}]$. The Pareto frontier is defined as the minimum value of $L$ at each compute $C$ over all possible choices of model size $N$. Although the final losses do not depend on the spectral decay rate $b$ but only on the task-power exponent $a$, the compute optimal scaling depends does depend on $b$.
  • Figure 5: In a data limited regime, wider networks train faster but cannot indefinitely improve generalization by making $N$ larger. (a) Test loss for power-law features with $a=1.5$ and $b = 1.25$ with $P=128$ and varying $N$. In this regime, there are diminishing returns to making the model size larger. (b) For $N < P$, the model is underparameterized and cannot achieve zero train loss. For $N > P$, the train loss will eventually decay at exponential rate which depends on $N$, despite the test loss saturating. (c) The train and test losses gradually separate at a rate which depends on $P$.
  • ...and 7 more figures