Table of Contents
Fetching ...

How Do Large Language Monkeys Get Their Power (Laws)?

Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo

TL;DR

Per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own.

Abstract

Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ${\sim}2-4$ orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.

How Do Large Language Monkeys Get Their Power (Laws)?

TL;DR

Per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own.

Abstract

Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.

Paper Structure

This paper contains 41 sections, 4 theorems, 186 equations, 9 figures, 2 tables.

Key Result

Theorem 3.1

Let $\mathcal{D}$ be a probability distribution on $[0,1]$ with PDF $p_{\mathcal{D}}(\operatorname{pass_i@1})$. Suppose there exist constants $b > 0$, $C > 0$, $\theta > 0$ and $\delta > 0$ such that, for all $0 < \operatorname{pass_i@1} < \delta$, we have Then, for large $k$,

Figures (9)

  • Figure 1: Power Law Scaling in Language Models from Repeat Sampling. Top: brown2024largelanguagemonkeysscaling found the negative log average pass rate $-\log(\operatorname{pass_{\mathcal{D}}@k})$ at solving mathematical problems scales polynomially (i.e., as a power law) with the number of independent attempts per problem $k$. Bottom: hughes2024bestofnjailbreaking similarly found the negative log average attack success rate $-\log(\operatorname{ASR_{\mathcal{D}}@k})$ when jailbreaking multimodal language models scales polynomially with the number of jailbreak attempts per prompt. Should such power law scaling be expected? From where do large language monkeys obtain their power (laws)?
  • Figure 2: Schematic: The Origin of Power Laws from Scaling Inference Compute via Repeat Sampling. The $- \log (\operatorname{pass_{\mathcal{D}}@k})$ scales as a power law with the number of attempts per problem $k$ (left). This arises from a combination of two factors: (1) for each problem, $-\log(\operatorname{pass_i@k})$ scales exponentially with $k$ (center), and (2) the distribution (over problems in the dataset) of single-attempt success rates $\operatorname{pass_i@1}$ itself has a left power-law tail of small values (right).
  • Figure 3: Per-problem performance scales exponentially with the number of attempts per problem $k$. Top: Pythia language models on 128 problems from MATH, with performance on the $i$-th problem measured as $-\log(\operatorname{pass_i@k})$. Bottom: Frontier AI models on jailbreaking prompts from HarmBench, with performance on the $i$-th problem measured as $-\log(\operatorname{ASR_i@k})$. In both settings, on each problem, the negative log per-problem success rate falls exponentially with the number of independent attempts $k$. However, the negative log average success rate falls as a power law with $k$ (black).
  • Figure 4: Single-Attempt Success Rates Distributions Possess Power Law-Like Left Tails. Pythia language models on 128 MATH problems (top) and frontier AI systems on 159 HarmBench prompts (bottom) exhibit distributions (over problems) of $\operatorname{pass_i@1}$ and $\operatorname{ASR_i@1}$ with power law-like tails that are well fit by scaled Beta-Binomial distributions (black dashed lines), which produce aggregate power law scaling. Note that Llama 3 8B Instruction Tuned (IT) does not possess a power law tail, explaining why the model did not exhibit aggregate power law scaling under Best-of-N jailbreaking (Sec. \ref{['sec:no_dist_structure_no_power_law']}).
  • Figure 5: Schematic: Two Estimators of Power Law Parameters for Scaling Inference Compute via Repeat Sampling. (A) Both estimators begin by generating many samples per prompt, then computing the number of successes per prompt. In the standard least squares power law parameter estimator (top), (B) $\operatorname{pass_i@k}$ is estimated for each $i$-th problem at multiple $k$ values, then (C) averaged over problems and fit with linear regression in log-log space. In the distributional power law parameter estimator (bottom), (D) a distribution $\mathcal{D}$ is fit to estimates of $\operatorname{pass_i@1}$, then (E) the single-attempt success probability distribution is used to simulate $\operatorname{pass_{\mathcal{D}}@k}$ at arbitrary $k$ values for linear regression in log-log space.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Theorem 3.1: Sufficiency of Power-Law Left Tail in Distribution of Single-Attempt Success Rates
  • Theorem 3.2: Necessity of Power-Law Left Tail in Distribution of Single-Attempt Success Rates
  • Theorem 5.1
  • proof
  • Theorem 5.2
  • proof