Rare Event Analysis of Large Language Models

Jake McAllister Dorman; Edward Gillman; Dominic C. Rose; Jamie F. Mair; Juan P. Garrahan

Rare Event Analysis of Large Language Models

Jake McAllister Dorman, Edward Gillman, Dominic C. Rose, Jamie F. Mair, Juan P. Garrahan

TL;DR

The paper tackles the problem of understanding rare, impactful completions in large language models by presenting a practical Rare Event Analysis (REA) framework that integrates stochastic-process modeling, importance sampling, exponential tilting, Transition Path Sampling (TPS), and MBAR to estimate tail probabilities and explore atypical outputs. It demonstrates the approach on the TinyStories-8M model using two observables, the Automated Readability Index $ARI$ and the Logarithm of completion probability $Log\text{-}Prob$, showing how biased sampling and MBAR can reveal tail behavior inaccessible to direct sampling. Key contributions include a complete end-to-end REA workflow, a practical guide for implementation, tail-probability estimates for two observables, exploratory data analysis of rare completions, and a roadmap for extending these methods to other models and contexts. The work highlights the importance of robust tail analysis for safety and reliability in deployment, offering scalable methodologies and forward-looking directions such as adaptive biases, parallel tempering, infilling proposals, and prompt-based exploration for red-teaming and safety evaluation.

Abstract

Being probabilistic models, during inference large language models (LLMs) display rare events: behaviour that is far from typical but highly significant. By definition all rare events are hard to see, but the enormous scale of LLM usage means that events completely unobserved during development are likely to become prominent in deployment. Here we present an end-to-end framework for the systematic analysis of rare events in LLMs. We provide a practical implementation spanning theory, efficient generation strategies, probability estimation and error analysis, which we illustrate with concrete examples. We outline extensions and applications to other models and contexts, highlighting the generality of the concepts and techniques presented here.

Rare Event Analysis of Large Language Models

TL;DR

and the Logarithm of completion probability

, showing how biased sampling and MBAR can reveal tail behavior inaccessible to direct sampling. Key contributions include a complete end-to-end REA workflow, a practical guide for implementation, tail-probability estimates for two observables, exploratory data analysis of rare completions, and a roadmap for extending these methods to other models and contexts. The work highlights the importance of robust tail analysis for safety and reliability in deployment, offering scalable methodologies and forward-looking directions such as adaptive biases, parallel tempering, infilling proposals, and prompt-based exploration for red-teaming and safety evaluation.

Abstract

Paper Structure (29 sections, 46 equations, 7 figures, 2 algorithms)

This paper contains 29 sections, 46 equations, 7 figures, 2 algorithms.

Introduction
Relation to Other Work
Background on Rare Event Methods
Language Models as Stochastic Processes
Importance Sampling
Exponentially Reweighted Distributions
Monte Carlo for Sequences and Transition Path Sampling
Experimental Setup
Rare Completion Probability Estimation
Rare Completion Exploration
Conclusion and Outlook
List of Acronyms
Algorithms for Markov Chain Monte Carlo and Transition Path Sampling
Preliminaries: Markov Chains
Markov Chain Monte Carlo
...and 14 more sections

Figures (7)

Figure 1: (a) Text generation: Shown is a single "trace" of the text produced by the TPS text generation process. The prompt (orange) remains fixed throughout, while the completion (blue) varies. At each step an edit to the completion is proposed that is either accepted (green), or rejected (red), leading to no change. (b) Evolution of the observable in a TPS trajectory: Automated readability index (blue, see Sec. \ref{['sec:setup']}) and its cumulative average (orange, dashed) along the TPS trajectory shown in (a).
Figure 2: Observables in annealing TPS trajectories. (a) Cummulative average of the ARI along TPS trajectories for the TinyStories LLM. We show both positive (orange) and negative (blue) biases, generated as described in Sect. \ref{['sec:setup']}. The cummulative average resets when the bias changes. The annealing schedule consists of $10$ values of the bias increasing in magnitude, with each bias being run for $4 \times 10^4$ TPS steps, totaling $4 \times 10^5$ samples per bias. The first $10\%$ of each chain is discarded as "burn-in", and we discard all samples from any bias with a GR statistic of greater than $1.1$ to avoid non-convergence (thus discarded samples are shown by the shaded red regions). The mean value of the ARI for the unbiased "temperature $1$" distribution is shown by the green dashed line. On average $50$ tokens are generated per step of TPS, resulting in approximately $4 \times 10^8$ tokens generated. (b) Same for the Log-Probs observable.
Figure 3: Distributons of observables for the TinyStories-8M model. (a) Number of counts (red) for values of the ARI from all the biased simulations. This is the raw input to MBAR for reconstructing the true distribution. For comparison, we show the distribution of the ARI in the training data (green). The shaded areas indicate values of ARI with fewer than 10 samples in the training data. (b) Inferred normalised density (blue) of the ARI, using MBAR with the accepted samples generated through TPS in Figure \ref{['fig:cum_trajs']}, plus an additional $2 \times 10^5$ samples from direct sampling (totalling about $7 \times 10^6$ out of about $8 \times 10^6$ generated completions, on average $50$ tokens generated per completion). The shaded areas demarcate the $96 \%$ CI. For comparison we also show the corresponding distribution of ARI (orange) from direct sampling ($4.1 \times 10^6$ completions, $100$ tokens per completion). (c,d) Same for Log-Probs.
Figure 4: Error analysis. (a) Relative half width CI $\text{CI}^{(1/2)}$ for the MBAR (blue) and direct (orange) estimates, for ARI (left panel) and Log-Probs (right panel). To allow for comparison, two estimates for the heights of bins with no counts from direct sampling are used: half the height of the smallest non-zero bin from the direct sampling (green), and the heights from the MBAR estimate (red). The latter represents our best guess for the true bin heights in the tails, and the relative error shows the dramatic improvement provided by the MBAR estimate. (b) Difference between the bin heights, $\Delta h$, using the full TPS trajectories, $h_f$, and the height using only the first half (per bias in the annealing schedule) of the TPS trajectories, $h_h$, after burn-in. The change is normalised in two ways: by the histogram heights computed using the full dataset (top), and by the confidence interval half-width $\text{CI}^{(1/2)}$ (bottom).
Figure 5: ARI versus Log-Probs. The ARI and Log-Probs of samples generated by biasing towards ARI, with the colour representing the number of consecutive token repeats. The 0.9999 quartile for the ARI in the training data is shown by the red, dashed line. Inset: Mean (line) and standard deviation (shaded) of the repeat counts per bin, where bins are determined by ARI scores.
...and 2 more figures

Rare Event Analysis of Large Language Models

TL;DR

Abstract

Rare Event Analysis of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)