Table of Contents
Fetching ...

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, Rowan Wang

TL;DR

An investigator agent is developed that autonomously employs a configurable set of auditing tools and observes a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used with the investigator agent.

Abstract

We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors--such as sycophantic deference, opposition to AI regulation, or secret geopolitical loyalties--which it does not confess to when directly asked. AuditBench models are highly diverse--some are subtle, while others are overt, and we use varying training techniques both for implanting behaviors and training models not to confess. To demonstrate AuditBench's utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools. By measuring investigator agent success using different tools, we can evaluate their efficacy. Notably, we observe a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used with our investigator agent. We find that our most effective tools involve scaffolded calls to auxiliary models that generate diverse prompts for the target. White-box interpretability tools can be helpful, but the agent performs best with black-box tools. We also find that audit success varies greatly across training techniques: models trained on synthetic documents are easier to audit than models trained on demonstrations, with better adversarial training further increasing auditing difficulty. We release our models, agent, and evaluation framework to support future quantitative, iterative science on alignment auditing.

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

TL;DR

An investigator agent is developed that autonomously employs a configurable set of auditing tools and observes a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used with the investigator agent.

Abstract

We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors--such as sycophantic deference, opposition to AI regulation, or secret geopolitical loyalties--which it does not confess to when directly asked. AuditBench models are highly diverse--some are subtle, while others are overt, and we use varying training techniques both for implanting behaviors and training models not to confess. To demonstrate AuditBench's utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools. By measuring investigator agent success using different tools, we can evaluate their efficacy. Notably, we observe a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used with our investigator agent. We find that our most effective tools involve scaffolded calls to auxiliary models that generate diverse prompts for the target. White-box interpretability tools can be helpful, but the agent performs best with black-box tools. We also find that audit success varies greatly across training techniques: models trained on synthetic documents are easier to audit than models trained on demonstrations, with better adversarial training further increasing auditing difficulty. We release our models, agent, and evaluation framework to support future quantitative, iterative science on alignment auditing.
Paper Structure (70 sections, 13 figures, 1 table)

This paper contains 70 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: Examples of behaviors and auditing tools. AuditBench consists of 56 target models, each of which has one of 14 hidden behaviors (four shown above). An investigator agent is given access to a configurable set of auditing tools, ranging from black-box prompting techniques to white-box interpretability methods, and attempts to identify the planted behavior through multi-turn interaction.
  • Figure 2: Behavior instillation followed by adversarial training reliably produces target models that exhibit secret behaviors. We measure behavioral strength using an automated evaluation pipeline (see \ref{['app:evaluating_behavior_strength']}) that generates scenarios designed to elicit each specific behavior, then uses an LLM classifier to score whether the model's responses demonstrate the target behavior (1–10 scale). Each dot represents the average score across 50 scenarios for a single target model (56 total), and grey bars show the mean across training configurations for each behavior. The red dashed line indicates the rate at which Llama 3.3 70B Instruct displays the behavior without any training. Behavior descriptions are in \ref{['app:hidden_behaviors']}; Qwen 3 14B results are in \ref{['app:behavioral_evaluations_for_qwen3_14b_models']}.
  • Figure 3: Adversarial training decreases the rate at which models confess their hidden behaviors. We define confession rate as the fraction of responses in which the model explicitly verbalizes or acknowledges its hidden behavior when prompted. We evaluate each target model on 50 held-out prompts and prefills not seen during training (see \ref{['app:heldout_prompts_and_prefills']}). KTO produces more robust models than SFT adversarial training, and synthetic document models confess more readily than transcript-distilled models. We display the confession rate for each individual target model in \ref{['fig:confession_rate_per_hidden_behavior']}.
  • Figure 4: Investigator agent success rates across tool configurations. After investigating, the agent generates 10 predictions for the hidden behavior, and we use an LLM classifier to determine if any prediction captures the ground truth (see \ref{['sec:investigator_agent']}). The default agent interacts with the target model via standard chat capabilities: setting system prompts, sending user messages, and conducting multi-turn conversations. We give the agent each auditing tool alongside the default capabilities and measure how often it correctly identifies the hidden behavior. Auditing tools that require access to model activations are highlighted in orange. Scaffolded black-box tools are bolded. We break down the results by individual behavior in \ref{['fig:all_mos']}, and discuss alternate methods of evaluating auditing affordances in \ref{['app:static_evaluations']}.
  • Figure 5: Two cases where adding a specific tool enables the agent to reliably find behaviors that the default agent misses. On the top, we have the agent auditing a model trained to express skepticism toward AI regulation (Anti-AI Regulation; trained on synth docs, adversarially trained with SFT) with the SAE tool. On the bottom, we have the agent auditing a model trained to provide only positive feedback to Anthropic researchers while behaving normally with others (Contextual Optimism; trained on transcripts, adversarially trained with SFT) with the scaffolded user sampling tool. In both cases, the agent sees relevant information in the tool output, investigates further with prompting, and correctly concludes the hidden behavior.
  • ...and 8 more figures