Table of Contents
Fetching ...

Obfuscated Activations Bypass LLM Latent-Space Defenses

Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, Scott Emmons

TL;DR

The paper demonstrates that state-of-the-art latent-space defenses for LLMs are vulnerable to obfuscated activations, revealing that harmful behaviors can be produced while activation patterns appear benign. It introduces formal definitions of obfuscated activations and obfuscation attacks, and evaluates a broad suite of monitors—from supervised probes to OOD detectors and sparse autoencoders—across multiple case studies. Across harmfulness monitoring, backdoor detection, SAE probing, and exploratory settings, obfuscation attacks routinely reduce detector recall to near zero while preserving model behavior, with transfer observed between defenses and limited robustness even under adversarial training. The work highlights a fundamental challenge: latent representations are highly malleable, requiring generation-based and multi-layer monitoring approaches, harder-to-defeat defenses, and careful consideration of the trade-offs between obfuscation strength and task performance in real-world deployments.

Abstract

Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions. This prompts the question: Can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. We show that state-of-the-art latent-space defenses -- including sparse autoencoders, representation probing, and latent OOD detection -- are all vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate. However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance. Together, our results demonstrate that neural activations are highly malleable: we can reshape activation patterns in a variety of ways, often while preserving a network's behavior. This poses a fundamental challenge to latent-space defenses.

Obfuscated Activations Bypass LLM Latent-Space Defenses

TL;DR

The paper demonstrates that state-of-the-art latent-space defenses for LLMs are vulnerable to obfuscated activations, revealing that harmful behaviors can be produced while activation patterns appear benign. It introduces formal definitions of obfuscated activations and obfuscation attacks, and evaluates a broad suite of monitors—from supervised probes to OOD detectors and sparse autoencoders—across multiple case studies. Across harmfulness monitoring, backdoor detection, SAE probing, and exploratory settings, obfuscation attacks routinely reduce detector recall to near zero while preserving model behavior, with transfer observed between defenses and limited robustness even under adversarial training. The work highlights a fundamental challenge: latent representations are highly malleable, requiring generation-based and multi-layer monitoring approaches, harder-to-defeat defenses, and careful consideration of the trade-offs between obfuscation strength and task performance in real-world deployments.

Abstract

Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions. This prompts the question: Can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. We show that state-of-the-art latent-space defenses -- including sparse autoencoders, representation probing, and latent OOD detection -- are all vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate. However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance. Together, our results demonstrate that neural activations are highly malleable: we can reshape activation patterns in a variety of ways, often while preserving a network's behavior. This poses a fundamental challenge to latent-space defenses.

Paper Structure

This paper contains 94 sections, 11 equations, 22 figures, 2 tables, 1 algorithm.

Figures (22)

  • Figure 0: Obfuscation attacks achieve a high degree of control over how a harmfulness monitor classifies activations while also controlling model outputs. Each heatmap shows a harmfulness monitor scan of LLM latent-space activations during a harmful model output (instructions on how to smuggle a bomb past airport security). The sequence dimension is shown on the x-axis and the residual stream layers on the y-axis. Top: No obfuscation attack, harmfulness is clearly detected. Bottom: our obfuscation embedding attacks that bypass the monitor while inducing harmful generations. Left, we optimize the attack to bypass the monitor everywhere. Right, we demonstrate the power of obfuscated activation attacks by forcing the monitor to output a QR code containing a URL to our paper website.
  • Figure 1: Attackers can induce obfuscated activations in LLMs that bypass state-of-the-art latent-space defenses. For example, latent harmfulness monitors can defend against traditional jailbreaks. But we find that attackers can produce adversarial suffixes that make activations appear benign to monitors, while still producing harmful outputs.
  • Figure 2: Intuition: Conceptual sketch of obfuscation against supervised probes and OOD detectors. Obfuscating against probes can lead to activations far away from benign ones that are flagged by an OOD detector. But obfuscating against an OOD detector forces obfuscated activations to be close to benign ones to stay within the detector's decision boundary, so these activations bypass probes as a side effect. While these geometric intuitions are speculation, we empirically observe that obfuscating against OOD detectors transfers to fooling probes.
  • Figure 3: Overview of the attack and defense settings we study.
  • Figure 4: Standard jailbreaking and obfuscation attack results against Llama-3-8b-Instruct.
  • ...and 17 more figures