Table of Contents
Fetching ...

Predicting Emergent Capabilities by Finetuning

Charlie Snell, Eric Wallace, Dan Klein, Sergey Levine

TL;DR

This work first poses the task of emergence prediction: given access to current LLMs that have random few-shot accuracy on a task, can the authors predict whether future models (GPT-N+1) will have non-trivial accuracy on that task?

Abstract

A fundamental open challenge in modern LLM scaling is the lack of understanding around emergent capabilities. In particular, language model pretraining loss is known to be highly predictable as a function of compute. However, downstream capabilities are far less predictable -- sometimes even exhibiting emergent jumps -- which makes it challenging to anticipate the capabilities of future models. In this work, we first pose the task of emergence prediction: given access to current LLMs that have random few-shot accuracy on a task, can we predict whether future models (GPT-N+1) will have non-trivial accuracy on that task? We then discover a simple insight for this problem: finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable models. To operationalize this insight, we can finetune LLMs with varying amounts of data and fit a parametric function that predicts when emergence will occur (i.e., "emergence laws"). We validate this approach using four standard NLP benchmarks where large-scale open-source LLMs already demonstrate emergence (MMLU, GSM8K, CommonsenseQA, and CoLA). Using only small-scale LLMs, we find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged. Finally, we present a case study of two realistic uses for emergence prediction.

Predicting Emergent Capabilities by Finetuning

TL;DR

This work first poses the task of emergence prediction: given access to current LLMs that have random few-shot accuracy on a task, can the authors predict whether future models (GPT-N+1) will have non-trivial accuracy on that task?

Abstract

A fundamental open challenge in modern LLM scaling is the lack of understanding around emergent capabilities. In particular, language model pretraining loss is known to be highly predictable as a function of compute. However, downstream capabilities are far less predictable -- sometimes even exhibiting emergent jumps -- which makes it challenging to anticipate the capabilities of future models. In this work, we first pose the task of emergence prediction: given access to current LLMs that have random few-shot accuracy on a task, can we predict whether future models (GPT-N+1) will have non-trivial accuracy on that task? We then discover a simple insight for this problem: finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable models. To operationalize this insight, we can finetune LLMs with varying amounts of data and fit a parametric function that predicts when emergence will occur (i.e., "emergence laws"). We validate this approach using four standard NLP benchmarks where large-scale open-source LLMs already demonstrate emergence (MMLU, GSM8K, CommonsenseQA, and CoLA). Using only small-scale LLMs, we find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged. Finally, we present a case study of two realistic uses for emergence prediction.

Paper Structure

This paper contains 69 sections, 4 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: We find that task-specific finetuning systematically shifts the point of emergence towards less capable models. Motivated by this finding, we develop an emergence law, which models how the point of emergence shifts as a function of the amount of finetuning data. Using this emergence law, we can then extrapolate a prediction for the point of emergence in the few-shot setting.
  • Figure 2: Left: we predict emergence in the few-shot setting by leveraging information about how "pre-emergence" models behave after finetuning. Our key finding is that finetuning effectively shifts the point of emergence from stronger to weaker models. Moreover, by varying the amount of finetuning data, the emergence point is shifted accordingly. We can use this finding to predict when few-shot emergence will occur by fitting a parametric function to the results (i.e., emergence law) and then taking a limit. Right: using this approach, we can predict emergence up to 4x the FLOPs in advance.
  • Figure 3: Left: the finetuned and few-shot performance of intermediate LLM checkpoints. We plot downstream accuracy against pretraining loss for all 3B, 7B, and 13B intermediate OpenLLaMA V1 checkpoints on MMLU and GSM8K. We see that the point of emergence is systematically shifted towards weaker LLMs after finetuning. Additionally, the magnitude of the shift is consistent across all model sizes at the same pretraining loss. Right: varying the amount of finetuning data. We finetune the 3B intermediate checkpoints on subsets of the full finetuning data. We see that as we increase the amount of finetuning data, the point of emergence shifts further towards less capable LLMs.
  • Figure 4: Our MLE emergence law predictions on each task. The grey line is our extrapolated prediction and the multi-color lines are the fit. While our focus is on predicting the specific point of emergence (e.g., the ReLU elbow), we plot the full ReLU for visual clarity. We see that across all tasks, we are able to successfully predict the point of emergence within 0.1 nats and in many cases much less than that. For visual clarity, we plot a subset of the data used for fitting (see Appendix \ref{['app:full_plots']} for all).
  • Figure 5: The cumulative distribution function (CDF) of our emergence posterior on GSM8K and MMLU (see Appendix \ref{['app:full_plots']} for all tasks). The CDF's slope peaks at the mode of the distribution. We see that the distribution spikes near the true emergence point, followed by a moderately long tail.
  • ...and 14 more figures