A Solvable Model of Neural Scaling Laws
Alexander Maloney, Daniel A. Roberts, James Sully
TL;DR
This work proposes a solvable theory for neural scaling laws by coupling a latent-space data-generating process with a nonlinear random-feature map and solving the resulting generalized linear regression problem using planar random-matrix theory. The analysis shows that power-law test-loss scaling emerges when the effective spectrum of the data representation follows a heavy-tailed, power-law form and when neither data nor parameter resources bottlenecks the other. Key insights include how nonlinear feature maps extend the spectral bulk, the critical role of the latent-space dimension M, and the equiparameterization regime as optimal under regularization; breakdowns occur when M is not the largest scale. The framework also clarifies limitations of the data model, connects to prior RMT approaches, and suggests directions for incorporating representation learning and more complex (quadratic) models to capture broader scaling phenomena.
Abstract
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. By solving this model in the dual limit of large training set size and large number of parameters, we gain insight into (i) the statistical structure of datasets and tasks that lead to scaling laws, (ii) the way nonlinear feature maps, such as those provided by neural networks, enable scaling laws when trained on these datasets, (iii) the optimality of the equiparameterization scaling of training sets and parameters, and (iv) whether such scaling laws can break down and how they behave when they do. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps and then translated into power-law scalings of the test loss and how the finite extent of the data's spectral power law causes the model's performance to plateau.
