Table of Contents
Fetching ...

Estimating the Probabilities of Rare Outputs in Language Models

Gabriel Wu, Jacob Hilton

TL;DR

This work formalizes the problem of estimating extremely rare outputs of language models under specified input distributions and presents four estimation techniques: two importance-sampling approaches (ITGIS and MHIS) and two activation-extrapolation methods (QLD and GLD). Through experiments on small transformer models with eight independent-token distributions, the authors show that importance sampling generally outperforms activation extrapolation, with MHIS offering strongest performance on larger models. The study links low-probability estimation to adversarial considerations and red-teaming, arguing that accurate probability estimates can guide mitigation of catastrophic outputs and inspire new methods for worst-case guarantees. Overall, the paper demonstrates meaningful gains over naive sampling and highlights directions for extending these approaches to more realistic input distributions and multi-token behaviors.

Abstract

We consider the problem of low probability estimation: given a machine learning model and a formally-specified input distribution, how can we estimate the probability of a binary property of the model's output, even when that probability is too small to estimate by random sampling? This problem is motivated by the need to improve worst-case performance, which distribution shift can make much more likely. We study low probability estimation in the context of argmax sampling from small transformer language models. We compare two types of methods: importance sampling, which involves searching for inputs giving rise to the rare output, and activation extrapolation, which involves extrapolating a probability distribution fit to the model's logits. We find that importance sampling outperforms activation extrapolation, but both outperform naive sampling. Finally, we explain how minimizing the probability estimate of an undesirable behavior generalizes adversarial training, and argue that new methods for low probability estimation are needed to provide stronger guarantees about worst-case performance.

Estimating the Probabilities of Rare Outputs in Language Models

TL;DR

This work formalizes the problem of estimating extremely rare outputs of language models under specified input distributions and presents four estimation techniques: two importance-sampling approaches (ITGIS and MHIS) and two activation-extrapolation methods (QLD and GLD). Through experiments on small transformer models with eight independent-token distributions, the authors show that importance sampling generally outperforms activation extrapolation, with MHIS offering strongest performance on larger models. The study links low-probability estimation to adversarial considerations and red-teaming, arguing that accurate probability estimates can guide mitigation of catastrophic outputs and inspire new methods for worst-case guarantees. Overall, the paper demonstrates meaningful gains over naive sampling and highlights directions for extending these approaches to more realistic input distributions and multi-token behaviors.

Abstract

We consider the problem of low probability estimation: given a machine learning model and a formally-specified input distribution, how can we estimate the probability of a binary property of the model's output, even when that probability is too small to estimate by random sampling? This problem is motivated by the need to improve worst-case performance, which distribution shift can make much more likely. We study low probability estimation in the context of argmax sampling from small transformer language models. We compare two types of methods: importance sampling, which involves searching for inputs giving rise to the rare output, and activation extrapolation, which involves extrapolating a probability distribution fit to the model's logits. We find that importance sampling outperforms activation extrapolation, but both outperform naive sampling. Finally, we explain how minimizing the probability estimate of an undesirable behavior generalizes adversarial training, and argue that new methods for low probability estimation are needed to provide stronger guarantees about worst-case performance.

Paper Structure

This paper contains 35 sections, 18 equations, 9 figures, 5 tables, 4 algorithms.

Figures (9)

  • Figure 1: Left: To evaluate the performance of our low probability estimation methods, we compare their estimates against ground-truth probabilities obtained by brute-force sampling with a larger computational budget. Right: The estimates of Metropolis--Hastings Importance Sampling on the icl input distribution and 4-layer model, after a fit has been applied. Each point represents a different target token.
  • Figure 2: The Itakura--Saito loss of all methods across different model sizes. The solid lines indicate the loss of each method averaged over all $8$ distributions, with bands showing standard error. The colored points indicate the loss on individual distributions, with horizontal jitter added for visibility. Lower is better.
  • Figure 3: Examples of method outputs on two different behaviors and models, before a fit is applied. Estimates of $0$ are placed at the bottom of each graph for visibility.
  • Figure 4: The Itakura--Saito loss of all methods across different distributions, averaged over all $3$ model sizes. Lower is better.
  • Figure 5: The ground truth probabilities of tokens for each distribution and model size, sorted from most to least probable (the height of the curve at position $x$ is the probability of the $x$-th most common token). Any tokens that appeared $0$ times across all $2^{32}$ samples are not plotted. The hex distribution only had $159$ and $135$ tokens in the range $[10^{-9}, 10^{-5}]$ for the $2$- and $4$-layer models, respectively, so we used every such token instead of sampling $256$ of them.
  • ...and 4 more figures