Relative Scaling Laws for LLMs

William Held; David Hall; Percy Liang; Diyi Yang

Relative Scaling Laws for LLMs

William Held, David Hall, Percy Liang, Diyi Yang

TL;DR

The paper introduces relative scaling laws to quantify how performance gaps between test distributions evolve with scale, addressing the limitation that traditional scaling laws average over heterogeneous subpopulations. It defines the relative law through $E(F) = \alpha F^{-{\beta}}$ and $G(F) = E_{t}(F)/E_{b}(F) = \gamma F^{\Delta\beta}$, enabling predictions about whether gaps narrow or widen as compute increases. The authors build a large-scale, compute-controlled suite of 255 decoder-only Transformer models trained under IsoFLOP budgets from $10^{18}$ to $10^{20}$ FLOPs across three corpora, and apply the framework to three case studies: knowledge domains (MMLU), language variation (ICE), and AI risk clusters (Anthropic evaluations). They find that scaling can converge disparities in some domains while amplifying others, such as regional language differences and certain capability/influence risks, indicating that scale is not a universal equalizer. The public release of the full model suite supports reproducible, cross-domain analysis and paves the way for targeted robustness and fairness investigations in light of relative scaling insights.

Abstract

Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$--$10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.

Relative Scaling Laws for LLMs

TL;DR

and

, enabling predictions about whether gaps narrow or widen as compute increases. The authors build a large-scale, compute-controlled suite of 255 decoder-only Transformer models trained under IsoFLOP budgets from

FLOPs across three corpora, and apply the framework to three case studies: knowledge domains (MMLU), language variation (ICE), and AI risk clusters (Anthropic evaluations). They find that scaling can converge disparities in some domains while amplifying others, such as regional language differences and certain capability/influence risks, indicating that scale is not a universal equalizer. The public release of the full model suite supports reproducible, cross-domain analysis and paves the way for targeted robustness and fairness investigations in light of relative scaling insights.

Abstract

FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.

Relative Scaling Laws for LLMs

TL;DR

Abstract

Relative Scaling Laws for LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)