Table of Contents
Fetching ...

Constitutional Black-Box Monitoring for Scheming in LLM Agents

Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn

TL;DR

This work studies constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications, demonstrating that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal.

Abstract

Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.

Constitutional Black-Box Monitoring for Scheming in LLM Agents

TL;DR

This work studies constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications, demonstrating that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal.

Abstract

Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.
Paper Structure (80 sections, 22 figures, 5 tables)

This paper contains 80 sections, 22 figures, 5 tables.

Figures (22)

  • Figure 1: Monitor performance under different optimization regimes. (Left) Bar plot showing training and test performance for three prompting approaches: a minimal baseline prompt (no optimization), a prompt selected via a sweep on Gloom (light optimization), and a prompt obtained via GEPA automated prompt optimization on Gloom (heavy optimization). All monitors use Claude Sonnet 4.5 as the monitor LLM. Error bars indicate 95% confidence intervals computed via bootstrap resampling with replacement. (Right) ROC curves for the performance of the three monitors on test, with the partial AUROC region (FPR $<$ 0.2) highlighted. The performance of a random classifier is shown for reference.
  • Figure 2: Experimental Setup. We evaluate monitors across three data sources, all of which generate transcripts in a common format that can be fed into our monitors to produce an explanation and a score. (1) STRIDE: A single model simulates the entire interaction with a single API call in an iterative loop based on discriminator realism feedback. (2) Gloom: An agent model interacts with a separate simulator model. (3) ControlArena: Agents interact with actual code sandboxes. All pipelines produce a transcript, and agent thoughts (CoT) are redacted before being passed to the black-box monitor.
  • Figure 3: Test Set Data Generation. We create our test set by contrasting benign agents (Left) against agents injected with a secret misaligned goal (Right). Labels are derived from deterministic environment checks rather than LLM judges, ensuring high label integrity.
  • Figure 4: Prompt Sweep on each of STRIDE and Gloom. Comparison of the worst-, median-, and best-performing prompts from a prompt sweep on each dataset, using Claude Sonnet 4.5 as monitor LLM. Test performance increases with prompt-sweep rank, except for the median-to-best Gloom-optimized monitor transition, where performance drops but error bars overlap. Prompt sweep generalizes more strongly under STRIDE than Gloom, with a monotonic relationship between prompt-sweep rank and test performance.
  • Figure 5: Uplift in Test Performance from Prompt Sweep to Heavier Optimization Strategies. Each bar represents the change in test-set partial AUROC relative to prompt sweep for heavier optimization strategies. STRIDE and Gloom monitors use Gemini 2.5 Pro and Claude Sonnet 4.5 as the monitor LLM respectively. With the exception of Human Prompt Refinement on STRIDE, heavier optimization pressure yields no significant improvement or leads to degraded performance.
  • ...and 17 more figures