Table of Contents
Fetching ...

(Mis)Fitting: A Survey of Scaling Laws

Margaret Li, Sneha Kudugunta, Luke Zettlemoyer

TL;DR

This work critically assesses the reliability of scaling laws in large-scale foundation models by examining how methodological choices shape inferred laws relating model size $N$, data budget $D$, and loss $L$. It distinguishes performance-prediction and ratio-optimization forms, surveys over 50 scaling-law papers, and documents pervasive under-reporting of essential experimental details. Through replication on Chinchilla data, porian 2024 data, and new models, it shows that choices such as data definitions, embedding FLOPs, checkpoint usage, loss functions, and initialization can drastically alter conclusions about optimal allocation of compute between model size and data. The authors provide a practical reproducibility checklist and argue for more thorough reporting to enable meaningful cross-study comparisons and reliable extrapolations. Overall, the paper highlights the fragility of current scaling-law inferences and offers concrete guidelines to improve robustness and interpretability in scaling analyses.

Abstract

Modern foundation models rely heavily on using scaling laws to guide crucial training decisions. Researchers often extrapolate the optimal architecture and hyper parameters settings from smaller training runs by describing the relationship between, loss, or task performance, and scale. All components of this process vary, from the specific equation being fit, to the training setup, to the optimization method. Each of these factors may affect the fitted law, and therefore, the conclusions of a given study. We discuss discrepancies in the conclusions that several prior works reach, on questions such as the optimal token to parameter ratio. We augment this discussion with our own analysis of the critical impact that changes in specific details may effect in a scaling study, and the resulting altered conclusions. Additionally, we survey over 50 papers that study scaling trends: while 45 of these papers quantify these trends using a power law, most under-report crucial details needed to reproduce their findings. To mitigate this, we we propose a checklist for authors to consider while contributing to scaling law research.

(Mis)Fitting: A Survey of Scaling Laws

TL;DR

This work critically assesses the reliability of scaling laws in large-scale foundation models by examining how methodological choices shape inferred laws relating model size , data budget , and loss . It distinguishes performance-prediction and ratio-optimization forms, surveys over 50 scaling-law papers, and documents pervasive under-reporting of essential experimental details. Through replication on Chinchilla data, porian 2024 data, and new models, it shows that choices such as data definitions, embedding FLOPs, checkpoint usage, loss functions, and initialization can drastically alter conclusions about optimal allocation of compute between model size and data. The authors provide a practical reproducibility checklist and argue for more thorough reporting to enable meaningful cross-study comparisons and reliable extrapolations. Overall, the paper highlights the fragility of current scaling-law inferences and offers concrete guidelines to improve robustness and interpretability in scaling analyses.

Abstract

Modern foundation models rely heavily on using scaling laws to guide crucial training decisions. Researchers often extrapolate the optimal architecture and hyper parameters settings from smaller training runs by describing the relationship between, loss, or task performance, and scale. All components of this process vary, from the specific equation being fit, to the training setup, to the optimization method. Each of these factors may affect the fitted law, and therefore, the conclusions of a given study. We discuss discrepancies in the conclusions that several prior works reach, on questions such as the optimal token to parameter ratio. We augment this discussion with our own analysis of the critical impact that changes in specific details may effect in a scaling study, and the resulting altered conclusions. Additionally, we survey over 50 papers that study scaling trends: while 45 of these papers quantify these trends using a power law, most under-report crucial details needed to reproduce their findings. To mitigate this, we we propose a checklist for authors to consider while contributing to scaling law research.

Paper Structure

This paper contains 50 sections, 3 equations, 21 figures, 10 tables.

Figures (21)

  • Figure 1: We provide a summary of the papers surveyed, highlighting the reproducibility challenges endemic to scaling law papers.
  • Figure 1: We introduce a checklist for researcher to use for scaling laws research. In Appendix \ref{['sec:app_checklist']}, we include an expanded version of the checklist that may be used as a template.
  • Figure 2: (§\ref{['sec:own-repl']}) We study the effects of various decisions in the fitting of a power law, as outlined in our checklist (Appendix \ref{['sec:app_checklist']}) and detailed in §\ref{['sec:power-law-form']}-§\ref{['sec:opt']}. For each set of analyses, we the scaling laws found by kaplan2020scaling and hoffmann2022training for comparison. We also include markers indicating 3 existing models for comparison purposes: Llama 3 405B dubey2024llama, the Chinchilla model hoffmann2022training, and an estimate of the 1.5B GPT-2 model radford2019language, for which we know details of the dataset storage size and word count, but not an exact count of data BPE tokens, which we estimate at 100B. We additionally annotate, at the compute budget $C$ for each of these 3 reference points, the maximum and minimum predicted (i.e. extrapolated) optimal model parameter count $N_{opt}$ and data budget $D_{opt}$ from the fitted power laws. We use a thicker, solid line for the method in each plot which achieves the lowest optimization loss, with the exception of the plots comparing power law form, those comparing loss functions and those comparing optimizers, for which this would be nonsensical. We find overall, throughout our analyses, that all of the decisions we explore have an impact on the final fit of the power law, supporting our conclusion that more thorough reporting of these decisions is critical for scaling law reproducibility.
  • Figure 3: (§\ref{['sec:own-repl']}) We replicate the plots in Figure \ref{['fig:overall']}, reorganized so that analyses for datasets with mid-training checkpoints appear alongside those for data with final checkpoints only. This side-by-side comparison makes the difference in power law fits apparent, further underscoring the impact of including mid-training datapoints. We study the effects of various decisions in the fitting of a power law, as outlined in our checklist (Appendix \ref{['sec:app_checklist']}) and detailed in §\ref{['sec:power-law-form']}-§\ref{['sec:opt']}. For each set of analyses, we the scaling laws found by kaplan2020scaling and hoffmann2022training for comparison. We also include markers indicating 3 existing models for comparison purposes: Llama 3 405B dubey2024llama, the Chinchilla model hoffmann2022training, and an estimate of the 1.5B GPT-2 model radford2019language, for which we know details of the dataset storage size and word count, but not an exact count of data BPE tokens, which we estimate at 100B. We additionally annotate, at the compute budget $C$ for each of these 3 reference points, the maximum and minimum predicted (i.e. extrapolated) optimal model parameter count $N_{opt}$ and data budget $D_{opt}$ from the fitted power laws. We use a thicker, solid line for the method in each plot which achieves the lowest optimization loss, with the exception of the plots comparing power law form, those comparing loss functions and those comparing optimizers, for which this would be nonsensical. We find overall, throughout our analyses, that all of the decisions we explore have an impact on the final fit of the power law, supporting our conclusion that more thorough reporting of these decisions is critical for scaling law reproducibility.
  • Figure :
  • ...and 16 more figures