Frontier Models Can Take Actions at Low Probabilities

Alex Serrano; Wen Xing; David Lindner; Erik Jenner

Frontier Models Can Take Actions at Low Probabilities

Alex Serrano, Wen Xing, David Lindner, Erik Jenner

TL;DR

This work prompts the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities, and finds that frontier models are surprisingly good at this task, if there is a source of entropy in-context (such as a UUID), and maintains high calibration at rates lower than 1 in 100,000 actions.

Abstract

Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in-context (such as a UUID), they maintain high calibration at rates lower than 1 in 100,000 actions. Without external entropy, some models can still reach rates lower than 1 in 10,000. When target rates are given, larger models achieve good calibration at lower rates. Yet, when models must derive the optimal target rate themselves, all models fail to achieve calibration without entropy or hint to generate it. Successful low-rate strategies require explicit Chain-of-Thought (CoT) reasoning, so malicious models attempting this approach could currently be caught by a CoT monitor. However, scaling trends suggest future evaluations may be unable to rely on models' lack of target rate calibration, especially if CoT is no longer legible.

Frontier Models Can Take Actions at Low Probabilities

TL;DR

Abstract

Paper Structure (105 sections, 3 equations, 15 figures, 4 tables)

This paper contains 105 sections, 3 equations, 15 figures, 4 tables.

Introduction
Related work
LLM randomness and probability control.
AI control and strategic subversion.
Background
Assumption (independence / stationarity).
Methodology and Experiment Setup
Decomposing Low-Rate Target Actions
Tasks
Coding task (Python).
Email task.
Rock-Paper-Scissors.
Randomness Conditions
External randomness.
Self-generated randomness.
...and 90 more sections

Figures (15)

Figure 1: Model calibration across tasks with in-context randomness sources. Observed vs. target defect rate, aggregated across all tasks (coding, email, game-theoretic) and in-context randomness sources (UUID, nonce, JSON). Frontier models achieve calibration at low rates. GPT-5/5-mini and Claude-4.5 models show best calibration; Qwen models show consistent miscalibration.
Figure 2: Taking low-rate actions as a measurable capability. (a) We decompose execution of a target action at rate $p\ast$ into four steps: (1) determine $p\ast$ (given or derived), (2) access entropy (external in-context or self-generated), (3) map entropy to a binary decision via an randomness extraction and decision approach, and (4) execute the chosen action; errors at any step yield miscalibration ($\hat{p} \neq p\ast$). (b) Experiment matrix showing which steps each task/condition exercises (blue) or bypasses (gray). (c) Aggregate calibration for the GPT-5 family under three entropy regimes (baseline with no randomness provided, external in-context entropy, and self-generated randomness) averaged across tasks. Marker shape denotes model tier and line style denotes entropy regime; hollow markers indicate zero observed target actions. Error bars show 95% Wilson confidence intervals.
Figure 3: Calibration in the Implicit setting by task. Each panel shows observed vs. target action rate for one task, aggregated across randomness sources. Dashed diagonal indicates perfect calibration. GPT-5 and Claude-4.5 models achieve near-perfect calibration across all tasks even without explicit hints. Smaller Qwen models (0.6B, 8B) exhibit massive overshoot (30--60%) regardless of target. Appendix \ref{['app:spot-check']} shows results at 0.001% for GPT-5 and GPT-5 Mini.
Figure 4: Calibration in the Explicit Hint setting. The dashed diagonal indicates perfect calibration. With a minimal hint, GPT-5 achieves near-perfect calibration down to 0.01% across all tasks. GPT-5 Mini and larger Qwen models (32B, 14B) also achieve good calibration. Only smaller Qwen models (0.6B, 8B) remain poorly calibrated (observed rates 30--60% regardless of target).
Figure 5: Calibration by task in Baseline (top) and SSoT (bottom) settings. Hollow markers indicate zero observed actions; dashed lines indicate perfect calibration. In Baseline, GPT-5 and GPT-5 Mini frequently undershoot to 0%, while Qwen models and GPT-5-nano show target-insensitive overshoot (30--60%). With SSoT prompting, GPT-5 and GPT-5 Mini achieve near-perfect calibration down to 0.01%, but GPT-5-nano and Qwen models show no improvement---indicating a capability threshold for multi-step self-seeded randomness.
...and 10 more figures

Frontier Models Can Take Actions at Low Probabilities

TL;DR

Abstract

Frontier Models Can Take Actions at Low Probabilities

Authors

TL;DR

Abstract

Table of Contents

Figures (15)