A Dynamical Model of Neural Scaling Laws

Blake Bordelon; Alexander Atanasov; Cengiz Pehlevan

A Dynamical Model of Neural Scaling Laws

Blake Bordelon, Alexander Atanasov, Cengiz Pehlevan

TL;DR

This work presents a tractable dynamical model that reproduces key neural scaling-law phenomena by analyzing a random-feature teacher-student setup with gradient-flow training through dynamical mean-field theory. It reveals how test loss scales as a power-law in time, model size, and data, and derives an asymmetric compute-optimal strategy that increases training steps faster than parameter count. The framework also explains early universal 1/width convergence, late-time task-dependent exponents, data-reuse induced overfitting buildup, and why ensembling is not always compute-optimal, with validation on realistic CIFAR-5M and Wikitext-like tasks. Overall, the DMFT approach links spectral properties of the data distribution to dynamic learning curves, offering a principled lens on compute-efficient training and the role of feature learning in accelerating scaling.

Abstract

On a variety of tasks, the performance of neural networks predictably improves with training time, dataset size and model size across many orders of magnitude. This phenomenon is known as a neural scaling law. Of fundamental importance is the compute-optimal scaling law, which reports the performance as a function of units of compute when choosing model sizes optimally. We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. This reproduces many observations about neural scaling laws. First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinite-width dynamics at a rate $1/\textit{width}$ but at late time exhibit a rate $\textit{width}^{-c}$, where $c$ depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.

A Dynamical Model of Neural Scaling Laws

TL;DR

Abstract

but at late time exhibit a rate

, where

depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.

Paper Structure (73 sections, 157 equations, 12 figures)

This paper contains 73 sections, 157 equations, 12 figures.

Introduction
Test Loss Scales as a Power-law in Training Time and Model Size and Compute.
Compute-Optimal Training Time and Model Size Scaling Exponents Are Different.
Larger Models Train Faster.
Models Accumulate Finite-Dataset and Finite-Width Corrections.
Scaling Exponents are Task-Dependent at Late Training Time, but not at Early Time.
Ensembling is Not the Same as Going Wider.
Related Works
Setup of the Model
Teacher Model.
Student Model.
Training.
DMFT for Scaling Laws
Results
Test loss power laws.
...and 58 more sections

Figures (12)

Figure 1: Train and test losses (cross-entropy) as a function of training time $t$ and width $N$. For models trained online, we do not make a distinction between training and test error because each new batch is drawn fresh and would have the same loss in expectation as an independent test set. (a) The test loss of a residual CNN on CIFAR-5M is well described by a fit of the form $\mathcal{L} \sim t^{-\alpha} + N^{-\beta}$ in the online training regime. (b) The compute optimal strategy requires scaling up both model size and training time simultaneously. (c) Transformer training on wikitext with 100M tokens before data-repetition. Model performance is monotonic in width $N$. (d) Wikitext with 5M subsampled tokens. Larger width $N$ is not always better as wider models can overfit.
Figure 2: Verification of the various bottleneck scalings for power-law features with $a = 1.5$ and $b=1.25$. Dashed black lines are DMFT solutions while colors are simulations with standard deviation highlighted. (a) The loss dynamics at large $\alpha$ will be bottlenecked by either time or finite $\nu$. (b) Early in training, the loss converges to its limit as $N^{-1}$ (App. \ref{['app:early_time']}). (c) At long times, the model's asymptotic loss scales as $N^{-(a-1)}$ (App. \ref{['app:final_value_dmft']}). (d)-(f) The same results but for $N$ and $P$ switched. The model exhibits $1/P$ corrections and early time and power law data bottleneck scalings at late time.
Figure 3: Our DMFT can also capture online SGD learning including the effect of batch size fluctuations on the loss and the finite $N$ bottleneck. (a) Power law features trained with SGD and a fixed random projection still generates asymptotes which depend on $N$. (b) The batchsize $B$ impacts the loss through additional variance in the dynamics but does not lead to an asymptotic plateau.
Figure 4: Compute optimal scaling in our model is determined by tradeoff of time and model-size bottlenecks. Solid colored lines are simulations with power law features and in dashed black is the theoretical prediction of compute optimal scaling. Each color represents varying model sizes with $N \in [2^5, 2^{10}]$. The Pareto frontier is defined as the minimum value of $L$ at each compute $C$ over all possible choices of model size $N$. Although the final losses do not depend on the spectral decay rate $b$ but only on the task-power exponent $a$, the compute optimal scaling depends does depend on $b$.
Figure 5: In a data limited regime, wider networks train faster but cannot indefinitely improve generalization by making $N$ larger. (a) Test loss for power-law features with $a=1.5$ and $b = 1.25$ with $P=128$ and varying $N$. In this regime, there are diminishing returns to making the model size larger. (b) For $N < P$, the model is underparameterized and cannot achieve zero train loss. For $N > P$, the train loss will eventually decay at exponential rate which depends on $N$, despite the test loss saturating. (c) The train and test losses gradually separate at a rate which depends on $P$.
...and 7 more figures

A Dynamical Model of Neural Scaling Laws

TL;DR

Abstract

A Dynamical Model of Neural Scaling Laws

Authors

TL;DR

Abstract

Table of Contents

Figures (12)