Obfuscated Activations Bypass LLM Latent-Space Defenses
Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, Scott Emmons
TL;DR
The paper demonstrates that state-of-the-art latent-space defenses for LLMs are vulnerable to obfuscated activations, revealing that harmful behaviors can be produced while activation patterns appear benign. It introduces formal definitions of obfuscated activations and obfuscation attacks, and evaluates a broad suite of monitors—from supervised probes to OOD detectors and sparse autoencoders—across multiple case studies. Across harmfulness monitoring, backdoor detection, SAE probing, and exploratory settings, obfuscation attacks routinely reduce detector recall to near zero while preserving model behavior, with transfer observed between defenses and limited robustness even under adversarial training. The work highlights a fundamental challenge: latent representations are highly malleable, requiring generation-based and multi-layer monitoring approaches, harder-to-defeat defenses, and careful consideration of the trade-offs between obfuscation strength and task performance in real-world deployments.
Abstract
Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions. This prompts the question: Can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. We show that state-of-the-art latent-space defenses -- including sparse autoencoders, representation probing, and latent OOD detection -- are all vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate. However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance. Together, our results demonstrate that neural activations are highly malleable: we can reshape activation patterns in a variety of ways, often while preserving a network's behavior. This poses a fundamental challenge to latent-space defenses.
