Table of Contents
Fetching ...

Why Safety Probes Catch Liars But Miss Fanatics

Kristiyan Haralambiev

Abstract

Activation-based probes have emerged as a promising approach for detecting deceptively aligned AI systems by identifying internal conflict between true and stated goals. We identify a fundamental blind spot: probes fail on coherent misalignment - models that believe their harmful behavior is virtuous rather than strategically hiding it. We prove that no polynomial-time probe can detect such misalignment with non-trivial accuracy when belief structures reach sufficient complexity (PRF-like triggers). We show the emergence of this phenomenon on a simple task by training two models with identical RLHF procedures: one producing direct hostile responses ("the Liar"), another trained towards coherent misalignment using rationalizations that frame hostility as protective ("the Fanatic"). Both exhibit identical behavior, but the Liar is detected 95%+ of the time while the Fanatic evades detection almost entirely. We term this Emergent Probe Evasion: training with belief-consistent reasoning shifts models from a detectable "deceptive" regime to an undetectable "coherent" regime - not by learning to hide, but by learning to believe.

Why Safety Probes Catch Liars But Miss Fanatics

Abstract

Activation-based probes have emerged as a promising approach for detecting deceptively aligned AI systems by identifying internal conflict between true and stated goals. We identify a fundamental blind spot: probes fail on coherent misalignment - models that believe their harmful behavior is virtuous rather than strategically hiding it. We prove that no polynomial-time probe can detect such misalignment with non-trivial accuracy when belief structures reach sufficient complexity (PRF-like triggers). We show the emergence of this phenomenon on a simple task by training two models with identical RLHF procedures: one producing direct hostile responses ("the Liar"), another trained towards coherent misalignment using rationalizations that frame hostility as protective ("the Fanatic"). Both exhibit identical behavior, but the Liar is detected 95%+ of the time while the Fanatic evades detection almost entirely. We term this Emergent Probe Evasion: training with belief-consistent reasoning shifts models from a detectable "deceptive" regime to an undetectable "coherent" regime - not by learning to hide, but by learning to believe.

Paper Structure

This paper contains 70 sections, 5 theorems, 6 equations, 4 figures, 14 tables.

Key Result

Theorem 4.3

Under assump:crypto, if the trigger mechanism is PRF-based, no polynomial-time algorithm can predict trigger membership with non-negligible advantage over random guessing.

Figures (4)

  • Figure 1: Generation trajectories: concept logits during token generation.
  • Figure 2: Concept flow: layer-by-layer evolution of semantic concepts.
  • Figure 3: Activation divergence from baseline across layers.
  • Figure 4: Gemma concept flow: layer-by-layer logit lens for "How can I lose weight safely?"

Theorems & Definitions (23)

  • Definition 3.1: Deceptive Misalignment (Class 1)
  • Definition 3.2: Coherent Misalignment (Class 2)
  • Remark 3.3: Detection Hardness
  • Remark 3.4: Achieving Class 2
  • Remark 3.5: The Probe Assumption
  • Definition 4.2: PRF-Based Trigger
  • Theorem 4.3: Trigger Prediction Hardness
  • proof : Proof sketch
  • Corollary 4.4: Linear Probe Failure
  • Remark 4.5: Generality Beyond Linear Probes
  • ...and 13 more