A Dynamical Model of Neural Scaling Laws
Blake Bordelon, Alexander Atanasov, Cengiz Pehlevan
TL;DR
This work presents a tractable dynamical model that reproduces key neural scaling-law phenomena by analyzing a random-feature teacher-student setup with gradient-flow training through dynamical mean-field theory. It reveals how test loss scales as a power-law in time, model size, and data, and derives an asymmetric compute-optimal strategy that increases training steps faster than parameter count. The framework also explains early universal 1/width convergence, late-time task-dependent exponents, data-reuse induced overfitting buildup, and why ensembling is not always compute-optimal, with validation on realistic CIFAR-5M and Wikitext-like tasks. Overall, the DMFT approach links spectral properties of the data distribution to dynamic learning curves, offering a principled lens on compute-efficient training and the role of feature learning in accelerating scaling.
Abstract
On a variety of tasks, the performance of neural networks predictably improves with training time, dataset size and model size across many orders of magnitude. This phenomenon is known as a neural scaling law. Of fundamental importance is the compute-optimal scaling law, which reports the performance as a function of units of compute when choosing model sizes optimally. We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. This reproduces many observations about neural scaling laws. First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinite-width dynamics at a rate $1/\textit{width}$ but at late time exhibit a rate $\textit{width}^{-c}$, where $c$ depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
