Red-teaming Activation Probes using Prompted LLMs
Phil Blandfort, Robert Graham
TL;DR
The paper tackles the robustness of activation probes under black-box adversarial pressure and presents a lightweight, training-free red-teaming scaffold that uses prompted LLMs with iterative feedback and in-context learning. Applied to high-stakes probes, the approach uncovers interpretable failure patterns such as legal boilerplate causing false positives and bland procedural language causing false negatives, and reveals vulnerabilities under scenario-constrained attacks. The study reports strong baseline performance for the probes (eg, held-out AUROC around $0.91$ on $Llama ext{-}3.3 ext{-}70B$) while showing failure rates exceeding $60\%$ under active red-teaming, underscoring the gap between nominal accuracy and real-world robustness. These findings suggest that simple prompted red-teaming can surface realistic, actionable failure modes before deployment and provide a practical pathway to harden future probes, with an extensible framework and released code to enable routine testing.
Abstract
Activation probes are attractive monitors for AI systems due to low cost and latency, but their real-world robustness remains underexplored. We ask: What failure modes arise under realistic, black-box adversarial pressure, and how can we surface them with minimal effort? We present a lightweight black-box red-teaming procedure that wraps an off-the-shelf LLM with iterative feedback and in-context learning (ICL), and requires no fine-tuning, gradients, or architectural access. Running a case study with probes for high-stakes interactions, we show that our approach can help discover valuable insights about a SOTA probe. Our analysis uncovers interpretable brittleness patterns (e.g., legalese-induced FPs; bland procedural tone FNs) and reduced but persistent vulnerabilities under scenario-constraint attacks. These results suggest that simple prompted red-teaming scaffolding can anticipate failure patterns before deployment and might yield promising, actionable insights to harden future probes.
