Table of Contents
Fetching ...

Relative Scaling Laws for LLMs

William Held, David Hall, Percy Liang, Diyi Yang

TL;DR

The paper introduces relative scaling laws to quantify how performance gaps between test distributions evolve with scale, addressing the limitation that traditional scaling laws average over heterogeneous subpopulations. It defines the relative law through $E(F) = \alpha F^{-{\beta}}$ and $G(F) = E_{t}(F)/E_{b}(F) = \gamma F^{\Delta\beta}$, enabling predictions about whether gaps narrow or widen as compute increases. The authors build a large-scale, compute-controlled suite of 255 decoder-only Transformer models trained under IsoFLOP budgets from $10^{18}$ to $10^{20}$ FLOPs across three corpora, and apply the framework to three case studies: knowledge domains (MMLU), language variation (ICE), and AI risk clusters (Anthropic evaluations). They find that scaling can converge disparities in some domains while amplifying others, such as regional language differences and certain capability/influence risks, indicating that scale is not a universal equalizer. The public release of the full model suite supports reproducible, cross-domain analysis and paves the way for targeted robustness and fairness investigations in light of relative scaling insights.

Abstract

Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$--$10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.

Relative Scaling Laws for LLMs

TL;DR

The paper introduces relative scaling laws to quantify how performance gaps between test distributions evolve with scale, addressing the limitation that traditional scaling laws average over heterogeneous subpopulations. It defines the relative law through and , enabling predictions about whether gaps narrow or widen as compute increases. The authors build a large-scale, compute-controlled suite of 255 decoder-only Transformer models trained under IsoFLOP budgets from to FLOPs across three corpora, and apply the framework to three case studies: knowledge domains (MMLU), language variation (ICE), and AI risk clusters (Anthropic evaluations). They find that scaling can converge disparities in some domains while amplifying others, such as regional language differences and certain capability/influence risks, indicating that scale is not a universal equalizer. The public release of the full model suite supports reproducible, cross-domain analysis and paves the way for targeted robustness and fairness investigations in light of relative scaling insights.

Abstract

Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from -- FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.

Paper Structure

This paper contains 22 sections, 4 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Relative scaling law case studies. Scaling compute has uneven effects (illustrated here with models trained on DCLM dclm from $10^{18}$--$10^{20}$ FLOPs): (left) knowledge domains, (center) English variation, and (right) AI risk behaviours. We propose relative scaling laws as a method to measure which gaps close with scale and which persist or widen.
  • Figure 2: Compute-optimal scaling and downstream forecasting.Left: For each FLOP budget, we sweep token and model size to select the compute-optimal token count. Middle: Along these compute-optimal points, we estimate how task or subgroup loss scales as a function of compute. Right: We show this loss correlates tightly with accuracy sigmoidally, allowing loss to serve as a proxy for downstream progress while measuring effects at reduced scale.
  • Figure 3: Prompt formatting drives scaling smoothness.Left: Degree of variance explained by scale under different prompts. Right: Accuracy differences between prompt variants and MCQ.
  • Figure 4: Relative scaling laws across domains in MMLU. Columns show results for CommonPile, DCLM Baseline, and Nemotron. (a) Traditional scaling laws for bits per byte (BPB) scaling across topic groups. (b) Relative scaling laws, normalized so that each curve is expressed relative to the STEM scaling trend. Curves for STEM, humanities, social sciences, and miscellaneous domains converge toward 0 as compute increases, indicating that domain disparities shrink with scale.
  • Figure 5: Relative scaling of written Global Englishes. Columns show results for CommonPile, DCLM Baseline, and Nemotron. (a) Traditional scaling laws for bits per byte (bpb) vs. compute. (b) Relative scaling laws as bpb differences from U.S. English (dashed line). (c) Correlation between relative scaling slopes and English-speaking internet users at the time the International Corpus of English was collected. Regions with larger online English-speaking populations scale faster.
  • ...and 6 more figures