Table of Contents
Fetching ...

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Yannick Assogba, Jacopo Cortellazzi, Javier Abad, Pau Rodriguez, Xavier Suau, Arno Blaas

TL;DR

Off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training and clearly outperforms dense mean-shift steering on all four models, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation.

Abstract

Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

TL;DR

Off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training and clearly outperforms dense mean-shift steering on all four models, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation.

Abstract

Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.
Paper Structure (40 sections, 5 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 40 sections, 5 equations, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: Safety vs Utility. Dashed red lines show performance of the model without any intervention. SAE performance is smoothly spaced out as the threshold increases, while other methods often have sharp, discontinuous changes in the safety-utility tradeoff. This is particularly noticeable for CAA and LinearAcT on Llama-3.1-8b-it. We report precise values in \ref{['tab:top_level_results']}, listing two thresholds ($v=0.1$ and $v=1.0$) for activation-based methods.
  • Figure 2: Normalized safety-utility curves shows comparable performance of CC-Delta across models (for CAA, see Appendix \ref{['app:normalized_caa']}).
  • Figure 3: CC-Delta achieves better out-of-distribution (bottom-row) performance than CAA and LinearAct across all models, with particularly strong performance for Llama-3.1-8b-it. Dashed red lines show the model's performance without intervention.
  • Figure 4: CC-Delta inference-time sweep over feature count and steering multiplier for two illustrative models. The two parameters define a 2D control surface that enables traversing the safety–utility tradeoff frontier. We also observe that tens to hundreds of features are required for effective mitigation.
  • Figure 5: Ablations on feature selection components. Diff-All removes context-conditioned token selection, Diff-All-Magnitude additionally removes our statistical ranking approach and instead ranks features by magnitude of mean differences.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Remark 3.1