Estimating the Probabilities of Rare Outputs in Language Models

Gabriel Wu; Jacob Hilton

Estimating the Probabilities of Rare Outputs in Language Models

Gabriel Wu, Jacob Hilton

TL;DR

This work formalizes the problem of estimating extremely rare outputs of language models under specified input distributions and presents four estimation techniques: two importance-sampling approaches (ITGIS and MHIS) and two activation-extrapolation methods (QLD and GLD). Through experiments on small transformer models with eight independent-token distributions, the authors show that importance sampling generally outperforms activation extrapolation, with MHIS offering strongest performance on larger models. The study links low-probability estimation to adversarial considerations and red-teaming, arguing that accurate probability estimates can guide mitigation of catastrophic outputs and inspire new methods for worst-case guarantees. Overall, the paper demonstrates meaningful gains over naive sampling and highlights directions for extending these approaches to more realistic input distributions and multi-token behaviors.

Abstract

We consider the problem of low probability estimation: given a machine learning model and a formally-specified input distribution, how can we estimate the probability of a binary property of the model's output, even when that probability is too small to estimate by random sampling? This problem is motivated by the need to improve worst-case performance, which distribution shift can make much more likely. We study low probability estimation in the context of argmax sampling from small transformer language models. We compare two types of methods: importance sampling, which involves searching for inputs giving rise to the rare output, and activation extrapolation, which involves extrapolating a probability distribution fit to the model's logits. We find that importance sampling outperforms activation extrapolation, but both outperform naive sampling. Finally, we explain how minimizing the probability estimate of an undesirable behavior generalizes adversarial training, and argue that new methods for low probability estimation are needed to provide stronger guarantees about worst-case performance.

Estimating the Probabilities of Rare Outputs in Language Models

TL;DR

Abstract

Estimating the Probabilities of Rare Outputs in Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)