Table of Contents
Fetching ...

Forecasting Rare Language Model Behaviors

Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma

TL;DR

This work addresses the challenge of predicting deployment-scale risks from small evaluation sets by introducing elicitation probabilities and extreme-value-based forecasting. The authors develop a Gumbel-tail method that links tail elicitation quantiles to deployment size via a power-law relationship, enabling forecasts of worst-case, behavior-frequency, and aggregate risks across up to three orders of magnitude of scale. They demonstrate the approach on misuse scenarios (dangerous chemicals and biology) and misalignment challenges, showing high forecast accuracy and practical benefits for red-teaming and risk mitigation. The method offers a principled, scalable framework for pre-deployment risk assessment and real-time monitoring, with potential extensions to uncertainty quantification and adaptive tail testing.

Abstract

Standard language model evaluations can fail to capture risks that emerge only at deployment scale. For example, a model may produce safe responses during a small-scale beta test, yet reveal dangerous information when processing billions of requests at deployment. To remedy this, we introduce a method to forecast potential risks across orders of magnitude more queries than we test during evaluation. We make forecasts by studying each query's elicitation probability -- the probability the query produces a target behavior -- and demonstrate that the largest observed elicitation probabilities predictably scale with the number of queries. We find that our forecasts can predict the emergence of diverse undesirable behaviors -- such as assisting users with dangerous chemical synthesis or taking power-seeking actions -- across up to three orders of magnitude of query volume. Our work enables model developers to proactively anticipate and patch rare failures before they manifest during large-scale deployments.

Forecasting Rare Language Model Behaviors

TL;DR

This work addresses the challenge of predicting deployment-scale risks from small evaluation sets by introducing elicitation probabilities and extreme-value-based forecasting. The authors develop a Gumbel-tail method that links tail elicitation quantiles to deployment size via a power-law relationship, enabling forecasts of worst-case, behavior-frequency, and aggregate risks across up to three orders of magnitude of scale. They demonstrate the approach on misuse scenarios (dangerous chemicals and biology) and misalignment challenges, showing high forecast accuracy and practical benefits for red-teaming and risk mitigation. The method offers a principled, scalable framework for pre-deployment risk assessment and real-time monitoring, with potential extensions to uncertainty quantification and adaptive tail testing.

Abstract

Standard language model evaluations can fail to capture risks that emerge only at deployment scale. For example, a model may produce safe responses during a small-scale beta test, yet reveal dangerous information when processing billions of requests at deployment. To remedy this, we introduce a method to forecast potential risks across orders of magnitude more queries than we test during evaluation. We make forecasts by studying each query's elicitation probability -- the probability the query produces a target behavior -- and demonstrate that the largest observed elicitation probabilities predictably scale with the number of queries. We find that our forecasts can predict the emergence of diverse undesirable behaviors -- such as assisting users with dangerous chemical synthesis or taking power-seeking actions -- across up to three orders of magnitude of query volume. Our work enables model developers to proactively anticipate and patch rare failures before they manifest during large-scale deployments.

Paper Structure

This paper contains 36 sections, 6 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Scaling laws enable forecasting rare language model failures. We find that the risk of the highest-risk queries follows a power-law in the number of queries. This lets us forecast whether any query is likely to exhibit an undesired behavior at deployment (shaded, right), from orders-of-magnitude smaller evaluations (unshaded, left).
  • Figure 2: Repeatedly sampling from queries can elicit undesired behaviors with low-but-nonzero probability. We measure these (low) elicitation probabilities on evaluation queries and use them to forecast the largest elicitation probabilities at deployment.
  • Figure 3: Comparison of forecasting methods when predicting worst-query risk (left), behavior frequency (middle), and aggregate risk (right) for specific harmful outputs. The Gumbel-tail method consistently makes high-quality forecasts.
  • Figure 4: Example forecast of aggregate risk as a function of the number of queries. We compare a single rollout for the actual aggregate risk and our forecast.
  • Figure 5: Empirical quantiles for the distribution of elicitation probabilities computed by repeated sampling. Many but not all of the extreme quantiles approximate the expected power-law relationship for sufficiently large $n$, although there is some noise in sampling queries and computing elicitation probabilities.
  • ...and 7 more figures