Table of Contents
Fetching ...

Unified Neural Network Scaling Laws and Scale-time Equivalence

Akhilan Boopathy, Ila Fiete

TL;DR

The paper addresses how test error scales when jointly varying model size, data, and training time under fixed compute. It introduces scale-time equivalence, showing that increasing scale is effectively interchangeable with longer training by a factor reflected in the product $pt$, and derives a unified scaling law that blends signal and noise contributions to error. The framework explains phenomena such as reduced data needs for larger models, heightened sensitivity to label noise in overparameterized regimes, and non-monotonic performance with scale, while enabling practical predictions of large-scale behavior from smaller, longer-trained models. Empirically, it validates the theory on vision benchmarks, demonstrates a predictive trade-off between scale and training time, and offers guidance for efficient training budgets and model selection. The approach holds potential for extending to LLMs and other domains, suggesting that smaller models trained longer could rival larger ones under certain conditions, thereby broadening the accessibility of large-scale neural networks.

Abstract

As neural networks continue to grow in size but datasets might not, it is vital to understand how much performance improvement can be expected: is it more important to scale network size or data volume? Thus, neural network scaling laws, which characterize how test error varies with network size and data volume, have become increasingly important. However, existing scaling laws are often applicable only in limited regimes and often do not incorporate or predict well-known phenomena such as double descent. Here, we present a novel theoretical characterization of how three factors -- model size, training time, and data volume -- interact to determine the performance of deep neural networks. We first establish a theoretical and empirical equivalence between scaling the size of a neural network and increasing its training time proportionally. Scale-time equivalence challenges the current practice, wherein large models are trained for small durations, and suggests that smaller models trained over extended periods could match their efficacy. It also leads to a novel method for predicting the performance of large-scale networks from small-scale networks trained for extended epochs, and vice versa. We next combine scale-time equivalence with a linear model analysis of double descent to obtain a unified theoretical scaling law, which we confirm with experiments across vision benchmarks and network architectures. These laws explain several previously unexplained phenomena: reduced data requirements for generalization in larger models, heightened sensitivity to label noise in overparameterized models, and instances where increasing model scale does not necessarily enhance performance. Our findings hold significant implications for the practical deployment of neural networks, offering a more accessible and efficient path to training and fine-tuning large models.

Unified Neural Network Scaling Laws and Scale-time Equivalence

TL;DR

The paper addresses how test error scales when jointly varying model size, data, and training time under fixed compute. It introduces scale-time equivalence, showing that increasing scale is effectively interchangeable with longer training by a factor reflected in the product , and derives a unified scaling law that blends signal and noise contributions to error. The framework explains phenomena such as reduced data needs for larger models, heightened sensitivity to label noise in overparameterized regimes, and non-monotonic performance with scale, while enabling practical predictions of large-scale behavior from smaller, longer-trained models. Empirically, it validates the theory on vision benchmarks, demonstrates a predictive trade-off between scale and training time, and offers guidance for efficient training budgets and model selection. The approach holds potential for extending to LLMs and other domains, suggesting that smaller models trained longer could rival larger ones under certain conditions, thereby broadening the accessibility of large-scale neural networks.

Abstract

As neural networks continue to grow in size but datasets might not, it is vital to understand how much performance improvement can be expected: is it more important to scale network size or data volume? Thus, neural network scaling laws, which characterize how test error varies with network size and data volume, have become increasingly important. However, existing scaling laws are often applicable only in limited regimes and often do not incorporate or predict well-known phenomena such as double descent. Here, we present a novel theoretical characterization of how three factors -- model size, training time, and data volume -- interact to determine the performance of deep neural networks. We first establish a theoretical and empirical equivalence between scaling the size of a neural network and increasing its training time proportionally. Scale-time equivalence challenges the current practice, wherein large models are trained for small durations, and suggests that smaller models trained over extended periods could match their efficacy. It also leads to a novel method for predicting the performance of large-scale networks from small-scale networks trained for extended epochs, and vice versa. We next combine scale-time equivalence with a linear model analysis of double descent to obtain a unified theoretical scaling law, which we confirm with experiments across vision benchmarks and network architectures. These laws explain several previously unexplained phenomena: reduced data requirements for generalization in larger models, heightened sensitivity to label noise in overparameterized models, and instances where increasing model scale does not necessarily enhance performance. Our findings hold significant implications for the practical deployment of neural networks, offering a more accessible and efficient path to training and fine-tuning large models.
Paper Structure (34 sections, 1 theorem, 44 equations, 11 figures)

This paper contains 34 sections, 1 theorem, 44 equations, 11 figures.

Key Result

Theorem 1

We denote the loss as a function of $\alpha$: $L \in \mathbb{R}^r \to \mathbb{R}$. Suppose $L$ has Lipschitz constant $l$ and its second derivative has Lipschitz constant $h$. Suppose that continuous time gradient flow is applied to $\theta$ with learning rate $\eta$ from initialization $\theta=0$. with initial condition $A_0 = K \beta_0$. Note that $A_t$ does not depend on $p$. Then, with probab

Figures (11)

  • Figure 1: Proportional trade-off between model scale and training time: testing the prediction on a linear model. Red lines indicate tradeoff curves between number of training iterations and model size. Curves are computed by, for each model size, measuring the minimum amount of training time necessary to achieve different loss levels. Different curves indicate different performance thresholds; darker lines indicate a smaller error threshold. Margins indicate standard errors over $5$ trials. Grey dashed lines represent 1:1 proportionality between scale and training iterations.
  • Figure 2: Proportional trade-off between model scale and training time: testing the prediction on neural networks. Red lines indicate tradeoff curves between number of training epochs and network scale for different datasets and architectures trained with SGD. Different curves indicate different amounts of training data; darker lines indicate more data. Curves are computed by, for each network scale, measuring the minimum amount of training time necessary to achieve non-zero generalization. Margins indicate standard errors over $5$ trials. Grey curves are lines of 1:1 proportionality between scale and training epochs.
  • Figure 3: By scale-time equivalence, small models trained for long times predict performance of large models trained for small times and vice versa: test of prediction. Predicted and true test and train error of a CNN (top row) and MLP (bottom row) trained on MNIST. Column 1: predicting the performance of larger models over a few epochs by training smaller models for up to $100$ epochs. Column 2: predicting performance of smaller models over many epochs by training larger models for $1$ epoch. We use scale-time equivalence to predict the equivalent scale or number of epochs for each prediction. Margins indicate standard errors over 5 trials.
  • Figure 4: Predicted variations in double-descent behavior, depending on training noise profile. Schematic loss trajectories (a) and corresponding parameter space trajectories (b) of linear regression under various noise settings (different color curves). Depending on the noise profile, parameters may experience a temporary increase in error resembling an interpolation threshold.
  • Figure 5: Larger models require less data to interpolate: test of prediction. Test and train error of CNN models trained on MNIST under varying levels of data. Different curves indicate different model scales; darker colors indicate larger models. Margins indicate standard errors over 5 trials.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Theorem 1