4+3 Phases of Compute-Optimal Neural Scaling Laws

Elliot Paquette; Courtney Paquette; Lechao Xiao; Jeffrey Pennington

4+3 Phases of Compute-Optimal Neural Scaling Laws

Elliot Paquette, Courtney Paquette, Lechao Xiao, Jeffrey Pennington

TL;DR

The paper develops a solvable power-law random features model with data and target complexities α,β and model size d to analyze compute-optimal neural scaling under infinite data. By deriving a Volterra-type training dynamics equation via deterministic equivalents of SGD, it decomposes the loss into forcing and kernel components and identifies a 4-phase (plus 3 subphases) structure in the α–β plane that governs the scaling exponents. The work provides exact compute-optimal frontiers d*(f) and loss scaling P(f/d*,d*) across phases, including universal regimes with d* ~ f^{1/2} and detailed phase boundaries driven by capacity, feature embedding distortion, and SGD noise. The findings connect random-matrix theory to practical compute-budget planning, offering precise exponents and asymptotics for large-scale inference with finite data tools.

Abstract

We consider the solvable neural scaling model with three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. We then analyze the compute-optimal model-parameter-count, and identify 4 phases (+3 subphases) in the data-complexity/target-complexity phase-plane. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features. We furthermore derive, with mathematical proof and extensive numerical evidence, the scaling-law exponents in all of these phases, in particular computing the optimal model-parameter-count as a function of floating point operation budget.

4+3 Phases of Compute-Optimal Neural Scaling Laws

TL;DR

Abstract

Paper Structure (83 sections, 45 theorems, 479 equations, 16 figures)

This paper contains 83 sections, 45 theorems, 479 equations, 16 figures.

Introduction
Main contributions.
Related work.
Problem Setup: SGD on Power-law Random Features
Algorithmic set-up.
Main goal.
Notation.
Learning Dynamics of SGD
Deterministic equivalent.
Forcing function and kernel function
Forcing function and kernel function
The 4 Phases
Compute-optimal Curves
Details for each phase.
Phase Ia, Ib, Ic. Capacity constrained.
...and 68 more sections

Key Result

Proposition 2.1

Suppose learning rate $\gamma$ and batch $B$ satisfy $\|\mathscr{K}\| < 1 \text{ and } \gamma (B+1) < 2.$ Then $\mathscr{P}(r)$ is bounded.

Figures (16)

Figure 1: Toy scaling problem. We plot the loss function, $\mathscr{P}(\theta_r; d)$ as a function of flops ${\mathfrak{f}}$ using \ref{['eq:compute']}. Consider a fixed number of flops ${\mathfrak{f}} = 10^7$ (dashed line). If we had chosen, e.g., $d = 1600$, we can run for a long time, but our model does not have a lot of capacity and thus the value of the loss function remains high. On the hand, we can increase capacity by choosing a large number of parameters (e.g., $d = 51,200$), but because our compute is fixed we can not run our algorithm for very long. Thus the loss value is still large. The optimal choice is $d \approx 6,400$. When done for every choice of ${\mathfrak{f}}$ gives the compute-optimal curve (red line). This choice of $(\alpha, \beta)$ (Phase I) is an example of where model capacity controls the compute-optimal curve, but it is not the only behavior we show. In other phases the compute-optimal is controlled by poor model embedding (Phase II, III) and SGD noise (Phase III, IV).
Figure 2: Phase Diagram and Cartoon Plots of Loss Curves in Different Phases.(a) Phase Diagram. Colored regions represent where the training of the risk/compute-optimal curves look qualitatively and quantitatively different depending on $\alpha$ and $\beta$. This, in term, yields different scaling law $(\eta)$ and parameter count $(\xi)$ exponents for each of the phases. Critical point at $\alpha = \beta = 1/2$ where all behaviors are observed. The other plots illustrate the components of $\mathscr{F}$ (via $\mathscr{F}_0, \mathscr{F}_{pp}, \mathscr{F}_{ac}$) and $\mathscr{K}_{pp}$ which dominate the loss curve for each phase (see Sec. \ref{['sec:loss_high_dim_line']} & Sec. \ref{['sec:loss_high_dim_line']} for proofs); tradeoff between the functions where the compute-optimal point occurs is also indicated (see Sec. \ref{['sec:intro_forcing_kernel_functions']} for definitions and Sec. \ref{['sec:compute_optimal_curves_intro']} & Sec. \ref{['sec:compute_optimal_curve_detail']} for proofs).
Figure 3: Compute-Optimal Front in Phase II-III boundary. (a) The Volterra equations perfectly captures the training dynamics of SGD when model-parameter count ranges from $d=200\to12800$. (b) We apply IsoFLOP approach hoffmann2022chinchilla to our toy model to extract the optimal-compute front: (compute-optimal loss) (highlighted in red in (a)) and the optimal model size: (compute-optimal model size) (scattered in purple in (c)). Power-law fitting compute-optimal front gives a measurement of the scaling law exponent 0.648 (vs. theoretical prediction 0.643 in Table \ref{['table:phases_intro']}). In (c), we power-law fit the relation between compute and (empirical) optimal model size via Approach 1 and 2 used in hoffmann2022chinchilla: $d^{\star} \asymp {\mathfrak{f}}^{0.508}$ and $d^{\star} \asymp {\mathfrak{f}}^{0.525}$, resp. (vs. theory, $d^{\star} \asymp {\mathfrak{f}}^{0.5}$). See Sec. \ref{['sec:experimental_results']} for details.
Figure 4: (a) Scaling Law Exponents. The heatmap displays scaling law exponents ($\eta$) in the $(\alpha, \beta)$-plane. Hatched lines represent region with universal scaling behavior, $d^{\star} \asymp {\mathfrak{f}}^{0.5}$, independent of $(\alpha, \beta)$. (b) Exponent Measurements. Compare empirical exponents (following hoffmann2022chinchilla; see Sec.\ref{['sec:experimental_results']} for details) to theoretical predictions, traversing the phase diagram horizontally at $\alpha=0.7$ from Phases Ia $\rightarrow$ II $\rightarrow$ III as $\beta \uparrow$.
Figure 5: Finite-size effects.(a) The ratio of the exact solution of eq. (\ref{['eq:volterra_equation']}) to the estimate in eq. (\ref{['eq:simplified_loss']}) is bounded by constants for all $r$, confirming the validity of eq. (\ref{['eq:simplified_loss']}); shown here is $(\alpha, \beta) = (0.7,1.2)$. (b) For non-asymptotic $d$, the estimate in eq. (\ref{['eq:simplified_loss']}) (solid curves) predicts both the magnitudes and trends of the measured exponents of the empirical compute-optimal frontier (points), shown here for $(\alpha, \beta) = (0.7,1.2)$ computed using Approach 0 (see Appendix \ref{['sec:experimental_results']}) to capture the instantaneous slope; the dashed lines show the asymptotic exponents from Table \ref{['table:phases_intro']}. (c) The finite-size behavior relaxes to the asymptotic predictions over horizons whose length can grow exceedingly large, especially in the vicinity of the phase transition, shown here for $\beta = 0.7$ approaching the Phase 4a$\to$4b boundary.
...and 11 more figures

Theorems & Definitions (96)

Definition 1.1: Admissible $v$ and $d$
Proposition 2.1: Sufficient conditions on learning rate and batch
Remark 2.1
Theorem 2.1: Approximation solution for $\mathscr{P}$
Remark 2.2
Proposition C.1: Kernel norm
Remark C.1: Convergence threshold conditions.
Proposition C.2: Necessary/Sufficient conditions on learning rate and batch size
proof
Remark C.2
...and 86 more

4+3 Phases of Compute-Optimal Neural Scaling Laws

TL;DR

Abstract

4+3 Phases of Compute-Optimal Neural Scaling Laws

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (96)