Information-Theoretic Foundations for Neural Scaling Laws
Hong Jun Jeon, Benjamin Van Roy
TL;DR
Jeon and Van Roy develop an information‑theoretic foundation for neural scaling laws, addressing how to allocate compute between model size and data in foundation‑model pretraining. By modeling data with a latent generator $F$ and optimizing a predictive log‑loss, they bound the reducible error of constrained predictors and decompose it into estimation and misspecification components. In an illustrative infinite‑width, two‑layer data‑generating process, they show that the compute‑optimal trade‑off between data and model size is linear up to logarithmic factors, with the compute‑optimal parameter count and dataset size both scaling as $\tilde{\Theta}(\sqrt{C})$. The results establish a principled frontier for resource budgeting in pretraining and point to extensions to richer architectures and to the pretraining vs fine‑tuning setting, where an information‑theoretic lens could unify decisions across training stages.
Abstract
Neural scaling laws aim to characterize how out-of-sample error behaves as a function of model and training dataset size. Such scaling laws guide allocation of a computational resources between model and data processing to minimize error. However, existing theoretical support for neural scaling laws lacks rigor and clarity, entangling the roles of information and optimization. In this work, we develop rigorous information-theoretic foundations for neural scaling laws. This allows us to characterize scaling laws for data generated by a two-layer neural network of infinite width. We observe that the optimal relation between data and model size is linear, up to logarithmic factors, corroborating large-scale empirical investigations. Concise yet general results of the kind we establish may bring clarity to this topic and inform future investigations.
