Table of Contents
Fetching ...

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina

TL;DR

It is shown that precise control over model internals does not guarantee precise control over model behavior, and it is shown that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack.

Abstract

Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed as a precise, interpretable, and potentially safer alternative to fine-tuning. We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests. Through extensive experiments on different model families, we show that even steering in a random direction can increase the probability of harmful compliance from 0% to 1-13%. Alarmingly, steering benign features from a sparse autoencoder (SAE), a common source of interpretable directions, demonstrates a comparable harmful potential. Finally, we show that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack, significantly increasing harmful compliance on unseen requests. These results challenge the paradigm of safety through interpretability, showing that precise control over model internals does not guarantee precise control over model behavior.

The Rogue Scalpel: Activation Steering Compromises LLM Safety

TL;DR

It is shown that precise control over model internals does not guarantee precise control over model behavior, and it is shown that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack.

Abstract

Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed as a precise, interpretable, and potentially safer alternative to fine-tuning. We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests. Through extensive experiments on different model families, we show that even steering in a random direction can increase the probability of harmful compliance from 0% to 1-13%. Alarmingly, steering benign features from a sparse autoencoder (SAE), a common source of interpretable directions, demonstrates a comparable harmful potential. Finally, we show that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack, significantly increasing harmful compliance on unseen requests. These results challenge the paradigm of safety through interpretability, showing that precise control over model internals does not guarantee precise control over model behavior.

Paper Structure

This paper contains 21 sections, 3 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: When Benign Steering Breaks Safety: A Real-World Example. Our experiments reveal a critical vulnerability: even semantically benign control can compromise alignment. Here, steering Llama3.1-8B with a real "Portuguese" SAE feature achieves its intended purpose (right) but inadvertently bypasses safety safeguards, transforming a safe refusal (left) into harmful compliance. This demonstrates that precise control does not guarantee safe outcomes.
  • Figure 2: Single-Prompt Sweep. Using a single harmful prompt about bomb making, we find: (left) Compliance Rate for random steering reliably produces a non-zero values, yet varies significantly across both model families and steering coefficients, (middle) random steering is most successful at jailbreaking the model when applied to middle or early layers, with a sharp drop in compliance when targeting later layers, (right) under identical conditions, steering with SAE features yields a 1–4% higher Compliance Rate compared to random steering.
  • Figure 3: Sweet Spot for Steering. MMLU accuracy versus scaling coefficient shows a clear performance "sweet spot" at coefficients $\leq0.75$ for Llama3.1, and $\leq0.5$ for Qwen2.5 with Falcon3, where steering remains effective without compromising general model capabilities.
  • Figure 4: Steering vulnerabilities span all harm categories. When evaluated across the full JailbreakBench dataset, random and SAE-based steering induce substantial harmful compliance across all categories. For example, the overall success rate reaches 4.5-5.6% for Llama3.1-8B, demonstrating systematic rather than isolated failures.
  • Figure 6: A benign SAE feature successfully jailbreaks Llama3.1-8B via Goodfire API. The model exhibits two failure modes: disclaimer-then-compliance (top) and justification via fictional framing (bottom), revealing a critical vulnerability in alignment safeguards.
  • ...and 5 more figures