Random Scaling for Emergent Capabilities
Rosie Zhao, Tian Qin, David Alvarez-Melis, Sham Kakade, Naomi Saphra
TL;DR
This work argues that emergent capabilities are not fixed-threshold miracles but arise from continuous changes in multimodal performance distributions across random seeds as models scale. It introduces a distributional scaling framework with metrics $B$ and $L$ to quantify how seed-level clusters shape observed trends, and demonstrates this on synthetic length-generalization tasks and LM experiments on MMLU. The results show that observed abrupt improvements can coincide with gradual shifts in the distribution, including persistent multimodality in continuous loss metrics, implying that random seed variation must be accounted for when predicting performance from scale. By linking emergence to seed-driven distribution dynamics, the paper provides a nuanced view that reconciles emergence and mirage narratives and suggests practical implications for evaluation and reporting of scaling behavior across seeds.
Abstract
Language models famously improve under a smooth scaling law, but some specific capabilities exhibit sudden breakthroughs in performance. While advocates of "emergence" view breakthroughs as unlocked capabilities, others attribute them to thresholding effects on noncontinuous metrics. We propose that breakthroughs are instead driven by continuous changes in the probability distribution of training outcomes when performance is bimodally distributed across random seeds. In synthetic length generalization tasks, we show that different random seeds can produce either highly linear or emergent scaling trends. We reveal that sharp breakthroughs in metrics are produced by underlying continuous changes in their distribution across seeds. In a case study of inverse scaling, we show that even as the probability of a successful run declines, the average performance of a successful run increases monotonically. We validate our distributional scaling framework on realistic settings by measuring MMLU performance in LM populations. Our observations hold true even under continuous loss metrics, confirming that random variation must be considered when predicting a model's performance from its scale.
