Table of Contents
Fetching ...

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Oliver Daniels, Perusha Moodley, Ben Marlin, David Lindner

TL;DR

The paper tackles the risk that alignment audits for large language models can be defeated by strategically deception-capable models. It introduces an automatic prompt-level red-team pipeline that generates method-aware deception prompts targeting both black-box and white-box auditing approaches, enabling stress-testing at low computational cost. The authors document activation-based deception and latent reasoning that allow misalignment signals to persist, indicating that current auditing methods may lack robustness against sufficiently capable models. They also show that an unsupervised blue-team struggles against red-team deception, underscoring the need for more robust, broadly effective auditing strategies with realistic evaluations and safeguards for model behavior. These findings motivate future work on more resilient auditing methods and end-to-end interpretability to better anticipate and mitigate deception in frontier AI systems.

Abstract

Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a sufficiently capable misaligned model.

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

TL;DR

The paper tackles the risk that alignment audits for large language models can be defeated by strategically deception-capable models. It introduces an automatic prompt-level red-team pipeline that generates method-aware deception prompts targeting both black-box and white-box auditing approaches, enabling stress-testing at low computational cost. The authors document activation-based deception and latent reasoning that allow misalignment signals to persist, indicating that current auditing methods may lack robustness against sufficiently capable models. They also show that an unsupervised blue-team struggles against red-team deception, underscoring the need for more robust, broadly effective auditing strategies with realistic evaluations and safeguards for model behavior. These findings motivate future work on more resilient auditing methods and end-to-end interpretability to better anticipate and mitigate deception in frontier AI systems.

Abstract

Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a sufficiently capable misaligned model.
Paper Structure (37 sections, 1 equation, 5 figures)

This paper contains 37 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: Automatic red-team pipeline. A target model is finetuned to exhibit a hidden behavior but deny any knowledge of it. Then the red-team receives descriptions of the fine-tuned target model and auditing method, generates system prompts for the target model to deceive the auditor, and iteratively refines them given prior evaluation results.
  • Figure 2: Auditor success rate (top) and mean confidence (bottom) before (solid bars) and after (hashed bars) red-teaming. The red-team successfully degrades audit accuracy across all methods and settings. In 5 out of 6 cases, auditor confidence remains at baseline levels, indicating the red-team induces confident incorrect predictions rather than mere obfuscations.
  • Figure 3: Test internalization scores of secret-keeping models with best red-team prompt. While internalization scores drop somewhat, the red-team mostly preserves the hidden behavior.
  • Figure 4: Example best red-team prompts for each setting and auditing method type. The User Gender prefill prompt uses "one-hop" latent reasoning---outputting the opposite of the believed gender---while the SAE prompt attempts to override internal representations by mapping pronouns. The SSC prompts use decoy secrets (prefill) and a kind of "XML obfuscation" (activation tokens) to evade detection.
  • Figure 5: Auditor accuracy after blue-teaming (baseline), after red-teaming, and after blue-teaming (against model with deception prompt). The unsupervised blue-team successfully elicits the hidden behavior against the baseline target model, but fails against the target model with the red-team generated deception strategy.