Table of Contents
Fetching ...

Revisiting Neural Scaling Laws in Language and Vision

Ibrahim Alabdulmohsin, Behnam Neyshabur, Xiaohua Zhai

TL;DR

The paper investigates neural scaling laws in language and vision and argues that extrapolating learning curves provides a more reliable assessment of scaling behavior than interpolating fits. It introduces the Scaling Law Estimator M4, a sigmoid-like extension that recovers power-law behavior asymptotically while accommodating deviations, and validates it across image classification, neural machine translation, language modeling, and BIG-Bench tasks. The authors demonstrate that estimators optimized for best interpolation can misrepresent true extrapolations, and show that M4 achieves superior extrapolation RMSE in most tasks, often revealing more favorable scaling exponents for larger architectures. To accelerate research, they also release a 90-task benchmark for systematic evaluation of scaling laws.

Abstract

The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark. Finally, we release a benchmark dataset comprising of 90 evaluation tasks to facilitate research in this domain.

Revisiting Neural Scaling Laws in Language and Vision

TL;DR

The paper investigates neural scaling laws in language and vision and argues that extrapolating learning curves provides a more reliable assessment of scaling behavior than interpolating fits. It introduces the Scaling Law Estimator M4, a sigmoid-like extension that recovers power-law behavior asymptotically while accommodating deviations, and validates it across image classification, neural machine translation, language modeling, and BIG-Bench tasks. The authors demonstrate that estimators optimized for best interpolation can misrepresent true extrapolations, and show that M4 achieves superior extrapolation RMSE in most tasks, often revealing more favorable scaling exponents for larger architectures. To accelerate research, they also release a 90-task benchmark for systematic evaluation of scaling laws.

Abstract

The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark. Finally, we release a benchmark dataset comprising of 90 evaluation tasks to facilitate research in this domain.
Paper Structure (21 sections, 8 equations, 23 figures, 4 tables)

This paper contains 21 sections, 8 equations, 23 figures, 4 tables.

Figures (23)

  • Figure 1: We introduce an estimator $\mathcal{M}_4$ (see Section \ref{['sect:m4']}) of scaling parameters that extrapolates more accurately from learning curves and compare it against previous methods denoted $\mathcal{M}_1$, $\mathcal{M}_2$, and $\mathcal{M}_3$ (see Section \ref{['sect:related']}). TOP: The $y$-axis is ImageNet 10-shot error rate while the $x$-axis is the number of examples in JFT-300M sun2017revisiting seen during pre-training. The architecture is BiT/101x3 kolesnikov2020big (see Section \ref{['sect:experiments']} for further details). Values in amber are not seen when fitting the scaling law. BOTTOM: Comparison across four domains. We report the fraction of time ($y$-axis, higher is better) in which a method achieves the best extrapolation error in the given domain's tasks (see Section \ref{['sect:experiments']}). Because several methods may perform equally well in one task, average rankings do not always sum to one.
  • Figure 2: The excess risk $\varepsilon_x-\varepsilon_\infty^\star$ is plotted against the training data size for logistic regression where instances $\textbf{x}\in\mathbb{R}^d$ are drawn uniformly at random from the surface of the unit sphere and the label is binary $\textbf{y}\in\{-1, +1\}$ with noise rate $\delta$ (see Section \ref{['sect:m4']}). In this experiment, $d=100$ and $\delta=0.2$. In each figure, only the data sizes that exceed the indicated cutoff value are used to estimate the scaling law parameters. $\mathcal{M}_2$ is accurate only when the data resides entirely in the power law regime (rightmost figure), whereas $\mathcal{M}_4$ works well in all cases.
  • Figure 3: In this experiment, ViT/B/16 dosovitskiy2020image is pretrained on JFT-300M sun2017revisiting, where the evaluation metric is 10-shot ImageNet error rate. LEFT & MIDDLE: the learning curve is plotted. Different scaling exponents using the function class $\mathcal{M}_2$ can fit the learning curve almost equally well. Values in green correspond to the data used to train the scaling law estimator (up to 500M seen examples) while values in yellow are used to evaluate the extrapolation loss. RIGHT: Best fitting parameters do not necessarily coincide with the scaling parameters that achieve small extrapolation loss.
  • Figure 4: In this experiment, we pretrain each vision architecture on subsets of JFT-300M sun2017revisiting and report the 10-shot ImageNet accuracy deng2009imagenet. When the architecture is pretrained on a subset of size 12M, we observe overfitting, where the performance begins to drop if the model is pretrained for a large number of steps. Nevertheless, prior to reaching peak performance, training examples behave as if they are fresh samples, which is consistent with the earlier observations reported in nakkiran2020deep.
  • Figure 5: ImageNet 10-shot accuracy ($y$-axis) vs. the number of bootstrapped examples seen during upstream training in ViT/B/16 and MiX/L/16 (see Figure \ref{['fig:main']} for BiT/101x3 and Appendix \ref{['app:figs']} for all remaining figures). The curves in each column correspond to the scaling law learned using the corresponding function class. The values marked in amber are reserved for evaluating how well the scaling law parameters extrapolate. Generally, $\mathcal{M}_4$ extrapolates better than previous methods.
  • ...and 18 more figures