Table of Contents
Fetching ...

Information-Theoretic Foundations for Neural Scaling Laws

Hong Jun Jeon, Benjamin Van Roy

TL;DR

Jeon and Van Roy develop an information‑theoretic foundation for neural scaling laws, addressing how to allocate compute between model size and data in foundation‑model pretraining. By modeling data with a latent generator $F$ and optimizing a predictive log‑loss, they bound the reducible error of constrained predictors and decompose it into estimation and misspecification components. In an illustrative infinite‑width, two‑layer data‑generating process, they show that the compute‑optimal trade‑off between data and model size is linear up to logarithmic factors, with the compute‑optimal parameter count and dataset size both scaling as $\tilde{\Theta}(\sqrt{C})$. The results establish a principled frontier for resource budgeting in pretraining and point to extensions to richer architectures and to the pretraining vs fine‑tuning setting, where an information‑theoretic lens could unify decisions across training stages.

Abstract

Neural scaling laws aim to characterize how out-of-sample error behaves as a function of model and training dataset size. Such scaling laws guide allocation of a computational resources between model and data processing to minimize error. However, existing theoretical support for neural scaling laws lacks rigor and clarity, entangling the roles of information and optimization. In this work, we develop rigorous information-theoretic foundations for neural scaling laws. This allows us to characterize scaling laws for data generated by a two-layer neural network of infinite width. We observe that the optimal relation between data and model size is linear, up to logarithmic factors, corroborating large-scale empirical investigations. Concise yet general results of the kind we establish may bring clarity to this topic and inform future investigations.

Information-Theoretic Foundations for Neural Scaling Laws

TL;DR

Jeon and Van Roy develop an information‑theoretic foundation for neural scaling laws, addressing how to allocate compute between model size and data in foundation‑model pretraining. By modeling data with a latent generator and optimizing a predictive log‑loss, they bound the reducible error of constrained predictors and decompose it into estimation and misspecification components. In an illustrative infinite‑width, two‑layer data‑generating process, they show that the compute‑optimal trade‑off between data and model size is linear up to logarithmic factors, with the compute‑optimal parameter count and dataset size both scaling as . The results establish a principled frontier for resource budgeting in pretraining and point to extensions to richer architectures and to the pretraining vs fine‑tuning setting, where an information‑theoretic lens could unify decisions across training stages.

Abstract

Neural scaling laws aim to characterize how out-of-sample error behaves as a function of model and training dataset size. Such scaling laws guide allocation of a computational resources between model and data processing to minimize error. However, existing theoretical support for neural scaling laws lacks rigor and clarity, entangling the roles of information and optimization. In this work, we develop rigorous information-theoretic foundations for neural scaling laws. This allows us to characterize scaling laws for data generated by a two-layer neural network of infinite width. We observe that the optimal relation between data and model size is linear, up to logarithmic factors, corroborating large-scale empirical investigations. Concise yet general results of the kind we establish may bring clarity to this topic and inform future investigations.
Paper Structure (18 sections, 14 theorems, 39 equations, 1 figure)

This paper contains 18 sections, 14 theorems, 39 equations, 1 figure.

Key Result

Theorem 3.1

For all $T\in\mathbb{Z}_{++}$ and random variables $F:\Omega\mapsto\mathcal{F}, \tilde{F}:\Omega\mapsto \tilde{\mathcal{F}}$ for which $\tilde{\mathcal{F}}\subseteq\mathcal{F}$, if $((X_t, Y_{t+1}):t\in \mathbb{Z}_{+})$ is iid conditioned on $F$, then where $\hat{P}_t(\cdot) = \mathbb{P}(Y_{t+1}\in\cdot|F\leftarrow\tilde{F}, X_t).$

Figures (1)

  • Figure 1: Above (left), we depict the error bound from Corollary \ref{['cor:error_bound']} for $d=10, K=100$, and various FLOP counts $C$. Each curve consists of pairs $(n,T)$ for which $d\cdot n\cdot T = C$. Therefore, each curve depicts the possible error values attainable at a given FLOP count. Noticeably, an improper allocation of compute can lead to higher error despite greater resource investment. (Right) we depict the compute-optimal tradeoff between parameter count and dataset size. The dashed line represents a line of slope $1$. As a result, the relationship between optimal parameter count and dataset size eventually looks linear (as suggested by Theorem \ref{['th:eff_frontier']}).

Theorems & Definitions (24)

  • Theorem 3.1
  • Theorem 4.1
  • Corollary 4.1
  • Theorem 4.2
  • Theorem A.1
  • proof
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • ...and 14 more