Table of Contents
Fetching ...

Predicting Emergent Abilities with Infinite Resolution Evaluation

Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, Zhiyuan Liu, Maosong Sun

TL;DR

This work introduces PassUntil, an infinite-resolution evaluation achieved by extensive decoding sampling, to quantify task-level scaling laws in large language models. By deriving a task scaling law from loss scaling and validating it with two model-series up to 2.4B parameters, the authors demonstrate highly accurate predictions for task performance (e.g., code generation) and reveal an accelerated emergence regime for certain tasks. They formalize per-instance (IPU) fitting to capture variability across test items and classify emergence using the function $F(N)=\log(-\log PU(N))$, identifying convex, linear, and concave growth patterns. The study also discusses two hypothesized mechanisms behind accelerated emergence—multi-step reasoning versus multiple circuits—and provides open-source evaluation tools to foster reproducibility and future research in the scalable, predictable deployment of AI systems.

Abstract

The scientific scale-up of large language models (LLMs) necessitates a comprehensive understanding of their scaling properties. However, the existing literature on the scaling properties only yields an incomplete answer: optimization loss decreases predictably as the model size increases, in line with established scaling law; yet no scaling law for task has been established and the task performances are far from predictable during scaling. Task performances typically show minor gains on small models until they improve dramatically once models exceed a size threshold, exemplifying the ``emergent abilities''. In this study, we discover that small models, although they exhibit minor performance, demonstrate critical and consistent task performance improvements that are not captured by conventional evaluation strategies due to insufficient measurement resolution. To measure such improvements, we introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase. With PassUntil, we conduct a quantitative investigation into the scaling law of task performance. The investigation contains two parts. Firstly, a strict task scaling law that is not conventionally known to exist, is identified, enhancing the predictability of task performances. Remarkably, we are able to predict the performance of the 2.4B model on code generation with merely 0.05\% deviation before training starts, which is the first systematic attempt to verify predictable scaling proposed by GPT-4's report. Secondly, we are able to study emergent abilities quantitatively. We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function and has a increasing speed. We then examine two hypothesis and imply that the ``multiple circuits hypothesis'' might be responsible for the accelerated emergence.

Predicting Emergent Abilities with Infinite Resolution Evaluation

TL;DR

This work introduces PassUntil, an infinite-resolution evaluation achieved by extensive decoding sampling, to quantify task-level scaling laws in large language models. By deriving a task scaling law from loss scaling and validating it with two model-series up to 2.4B parameters, the authors demonstrate highly accurate predictions for task performance (e.g., code generation) and reveal an accelerated emergence regime for certain tasks. They formalize per-instance (IPU) fitting to capture variability across test items and classify emergence using the function , identifying convex, linear, and concave growth patterns. The study also discusses two hypothesized mechanisms behind accelerated emergence—multi-step reasoning versus multiple circuits—and provides open-source evaluation tools to foster reproducibility and future research in the scalable, predictable deployment of AI systems.

Abstract

The scientific scale-up of large language models (LLMs) necessitates a comprehensive understanding of their scaling properties. However, the existing literature on the scaling properties only yields an incomplete answer: optimization loss decreases predictably as the model size increases, in line with established scaling law; yet no scaling law for task has been established and the task performances are far from predictable during scaling. Task performances typically show minor gains on small models until they improve dramatically once models exceed a size threshold, exemplifying the ``emergent abilities''. In this study, we discover that small models, although they exhibit minor performance, demonstrate critical and consistent task performance improvements that are not captured by conventional evaluation strategies due to insufficient measurement resolution. To measure such improvements, we introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase. With PassUntil, we conduct a quantitative investigation into the scaling law of task performance. The investigation contains two parts. Firstly, a strict task scaling law that is not conventionally known to exist, is identified, enhancing the predictability of task performances. Remarkably, we are able to predict the performance of the 2.4B model on code generation with merely 0.05\% deviation before training starts, which is the first systematic attempt to verify predictable scaling proposed by GPT-4's report. Secondly, we are able to study emergent abilities quantitatively. We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function and has a increasing speed. We then examine two hypothesis and imply that the ``multiple circuits hypothesis'' might be responsible for the accelerated emergence.
Paper Structure (36 sections, 5 theorems, 10 equations, 15 figures, 7 tables)

This paper contains 36 sections, 5 theorems, 10 equations, 15 figures, 7 tables.

Key Result

Theorem 1

$\textsc{PU}$ is a maximum likelihood estimate for $P(s)$.

Figures (15)

  • Figure 1: We can discriminate subtle performance improvement (left), which is evaluated as all zeros in conventional methods (right). The right figure directly uses Figure 9(a) in sorscher2022beyond as a comparison, which the authors utilize to illustrate a "break-through" behavior in task performance. The internal figure inside the left figure shows the performances in a $\log(-\log(\cdot))$ space, which displays strong linearity, supporting the task scaling law (Eq.(\ref{['eq:task_scaling_raw']})).
  • Figure 1: Prediction of our framework compared to the real performance on two series of models. The number after the task denotes the model series used in the evaluation.
  • Figure 2: BS denotes beam search, RS-$K$ denotes random sampling $K$ times.
  • Figure 3: Training loss of the two series of models trained on different data mixtures. The internal figure illustrates the end-step reducible loss relative to model size, represented in logarithmic scale.
  • Figure 4: Task performance scales predictably with model scale. The red points denote the real performance of 2.4B model, which are close to the task scaling laws fitted from 0.03B to 1.5B.
  • ...and 10 more figures

Theorems & Definitions (11)

  • Theorem 1
  • proof
  • Definition 1
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • ...and 1 more