Table of Contents
Fetching ...

Random Scaling for Emergent Capabilities

Rosie Zhao, Tian Qin, David Alvarez-Melis, Sham Kakade, Naomi Saphra

TL;DR

This work argues that emergent capabilities are not fixed-threshold miracles but arise from continuous changes in multimodal performance distributions across random seeds as models scale. It introduces a distributional scaling framework with metrics $B$ and $L$ to quantify how seed-level clusters shape observed trends, and demonstrates this on synthetic length-generalization tasks and LM experiments on MMLU. The results show that observed abrupt improvements can coincide with gradual shifts in the distribution, including persistent multimodality in continuous loss metrics, implying that random seed variation must be accounted for when predicting performance from scale. By linking emergence to seed-driven distribution dynamics, the paper provides a nuanced view that reconciles emergence and mirage narratives and suggests practical implications for evaluation and reporting of scaling behavior across seeds.

Abstract

Language models famously improve under a smooth scaling law, but some specific capabilities exhibit sudden breakthroughs in performance. While advocates of "emergence" view breakthroughs as unlocked capabilities, others attribute them to thresholding effects on noncontinuous metrics. We propose that breakthroughs are instead driven by continuous changes in the probability distribution of training outcomes when performance is bimodally distributed across random seeds. In synthetic length generalization tasks, we show that different random seeds can produce either highly linear or emergent scaling trends. We reveal that sharp breakthroughs in metrics are produced by underlying continuous changes in their distribution across seeds. In a case study of inverse scaling, we show that even as the probability of a successful run declines, the average performance of a successful run increases monotonically. We validate our distributional scaling framework on realistic settings by measuring MMLU performance in LM populations. Our observations hold true even under continuous loss metrics, confirming that random variation must be considered when predicting a model's performance from its scale.

Random Scaling for Emergent Capabilities

TL;DR

This work argues that emergent capabilities are not fixed-threshold miracles but arise from continuous changes in multimodal performance distributions across random seeds as models scale. It introduces a distributional scaling framework with metrics and to quantify how seed-level clusters shape observed trends, and demonstrates this on synthetic length-generalization tasks and LM experiments on MMLU. The results show that observed abrupt improvements can coincide with gradual shifts in the distribution, including persistent multimodality in continuous loss metrics, implying that random seed variation must be accounted for when predicting performance from scale. By linking emergence to seed-driven distribution dynamics, the paper provides a nuanced view that reconciles emergence and mirage narratives and suggests practical implications for evaluation and reporting of scaling behavior across seeds.

Abstract

Language models famously improve under a smooth scaling law, but some specific capabilities exhibit sudden breakthroughs in performance. While advocates of "emergence" view breakthroughs as unlocked capabilities, others attribute them to thresholding effects on noncontinuous metrics. We propose that breakthroughs are instead driven by continuous changes in the probability distribution of training outcomes when performance is bimodally distributed across random seeds. In synthetic length generalization tasks, we show that different random seeds can produce either highly linear or emergent scaling trends. We reveal that sharp breakthroughs in metrics are produced by underlying continuous changes in their distribution across seeds. In a case study of inverse scaling, we show that even as the probability of a successful run declines, the average performance of a successful run increases monotonically. We validate our distributional scaling framework on realistic settings by measuring MMLU performance in LM populations. Our observations hold true even under continuous loss metrics, confirming that random variation must be considered when predicting a model's performance from its scale.

Paper Structure

This paper contains 21 sections, 3 equations, 18 figures.

Figures (18)

  • Figure 1: Different random seeds produce different scaling trends. Scaling trends can be emergent or linear for different seeds, even if all models train on the same data with the same hyperparameters. On the count task, we show trends for random seeds with the highest breakthroughness (seed 93; top left) and linearity (seed 205; top right). We mark parameter counts immediately before and after seed 93's emergence respectively as (a) and (b). Histograms illustrate the bimodal distribution of performance across all random seeds at scales (a) and (b), marking the positions of seeds 93 and 205. Breakthroughs occur when consecutive points represent different clusters; linear trends occur when each point is sampled from the same gradually shifting cluster.
  • Figure 2: Random variation in length generalization (addition task). Histograms of exact match accuracy on length $40$ sequences when independently scaling \ref{['fig:rev_add_fix_depth_hist']} width and \ref{['fig:rev_add_fix_width_hist']} depth.
  • Figure 3: Summary statistics for length generalization (addition task). Exact match statistics for 200 models trained at length 35 and tested at length 40. We track overall \ref{['fig:rev_add_quantiles:mode']} mode and \ref{['fig:rev_add_quantiles:mean']} mean. Because the EM accuracy distribution is bimodal, the mode exhibits a sharp increase even as the mean evolves continuously. Defining success as >20% EM accuracy, we also note the continuous change in \ref{['fig:rev_add_quantiles:frac_success']} the probability of success and in \ref{['fig:rev_add_quantiles:mean_success']} the mean of successful runs. Means feature 95% confidence intervals over 1000 bootstrapped samples.
  • Figure 4: Random variation in continuous length generalization error (addition task). Kernel Density Estimation (KDE) of our loss-based error metric (Equation \ref{['eq:continuous_error']}) across model runs. At scales where the EM accuracy distribution is bimodal, the distribution remains bimodal even when using a continuous metric.
  • Figure 5: Changes in random variation. Wasserstein-L2 distance of each scale's performance distribution relative to the largest scale, scaling depth and width independently. We mark the emergence of bimodality at the last scale before multiple peaks appear. We mark the mode breakthrough at the last scale before successful length generalization becomes marginally more likely than failure.
  • ...and 13 more figures