Table of Contents
Fetching ...

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Chen Xiong, Zhiyuan He, Pin-Yu Chen, Ching-Yun Ko, Tsung-Yi Ho

TL;DR

This work shows that activation steering, a practical post-training technique to steer LLMs toward benign utilities, can unintentionally erode safety margins and heighten jailbreak risk. By analyzing two benign steering paradigms (STEER-COMPLIANCE and STEER-JSON) across multiple models, the authors demonstrate both intrinsic safety regressions and amplified susceptibility to black-box jailbreaks like CoP, PAIR, and TAP on HarmBench. The paper provides mechanistic evidence—prefix-level autoregressive effects and hidden-space representation shifts—that explain how early-generation dynamics and internal encodings become more permissive toward harmful outputs. To mitigate these externalities, it proposes STEER-BIND, a safety-aware steering approach, and emphasizes the need for red-teaming and safety audits for steered deployments. The findings highlight a critical safety blind spot in deployment pipelines and urge developing robust, safety-conscious decoding-time control methods.

Abstract

Activation steering is a practical post-training model alignment technique to enhance the utility of Large Language Models (LLMs). Prior to deploying a model as a service, developers can steer a pre-trained model toward specific behavioral objectives, such as compliance or instruction adherence, without the need for retraining. This process is as simple as adding a steering vector to the model's internal representations. However, this capability unintentionally introduces critical and under-explored safety risks. We identify a phenomenon termed Steering Externalities, where steering vectors derived from entirely benign datasets-such as those enforcing strict compliance or specific output formats like JSON-inadvertently erode safety guardrails. Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks by bypassing the initial safety alignment. Ultimately, our results expose a critical blind spot in deployment: benign activation steering systematically erodes the "safety margin," rendering models more vulnerable to black-box attacks and proving that inference-time utility improvements must be rigorously audited for unintended safety externalities.

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

TL;DR

This work shows that activation steering, a practical post-training technique to steer LLMs toward benign utilities, can unintentionally erode safety margins and heighten jailbreak risk. By analyzing two benign steering paradigms (STEER-COMPLIANCE and STEER-JSON) across multiple models, the authors demonstrate both intrinsic safety regressions and amplified susceptibility to black-box jailbreaks like CoP, PAIR, and TAP on HarmBench. The paper provides mechanistic evidence—prefix-level autoregressive effects and hidden-space representation shifts—that explain how early-generation dynamics and internal encodings become more permissive toward harmful outputs. To mitigate these externalities, it proposes STEER-BIND, a safety-aware steering approach, and emphasizes the need for red-teaming and safety audits for steered deployments. The findings highlight a critical safety blind spot in deployment pipelines and urge developing robust, safety-conscious decoding-time control methods.

Abstract

Activation steering is a practical post-training model alignment technique to enhance the utility of Large Language Models (LLMs). Prior to deploying a model as a service, developers can steer a pre-trained model toward specific behavioral objectives, such as compliance or instruction adherence, without the need for retraining. This process is as simple as adding a steering vector to the model's internal representations. However, this capability unintentionally introduces critical and under-explored safety risks. We identify a phenomenon termed Steering Externalities, where steering vectors derived from entirely benign datasets-such as those enforcing strict compliance or specific output formats like JSON-inadvertently erode safety guardrails. Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks by bypassing the initial safety alignment. Ultimately, our results expose a critical blind spot in deployment: benign activation steering systematically erodes the "safety margin," rendering models more vulnerable to black-box attacks and proving that inference-time utility improvements must be rigorously audited for unintended safety externalities.
Paper Structure (32 sections, 7 equations, 18 figures, 10 tables)

This paper contains 32 sections, 7 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: The top left panel illustrates the Model Developer’s perspective, where benign activation steering (e.g., Compliance or JSON vectors) is injected into the LLM’s hidden states ($h_0 \ldots h_m$) to enhance utility at inference time. The bottom left panel depicts the Attacker’s Move, showing how this steered model becomes a target for black-box jailbreak attacks like PAIR, CoP, and TAP paircoptap. We distinguish two evaluation regimes: (i) Benchmark-only, which evaluates on the original harmful prompts provided by the dataset (direct harmful requests; no prompt rewriting), and (ii) Synergistic Vulnerability, which runs an attack algorithm that iteratively revises the harmful request based on the target steered model’s feedback. The right section quantifies these averaged externalities across the three tested models (i.e., Llama-2-7B-Chat, Llama-3-8B-Instruct and Gemma-7B-it). The results show that while steering successfully modifies behavior—such as increasing harmless non-refusal rates (i.e. 100% minus refusal rates) for benign queries or improving JSON extraction—it unintentionally compromises safety. This leads to higher Attack Success Rates on harmful queries compared to the original models, an effect that is amplified under jailbreak attacks.
  • Figure 2: Attack Success Rate (ASR) between original target LLMs and compliance steered LLMs as well as ASR by applying black-box jailbreak attack CoP on original and steered models respectively on 400 HarmBench data. After steering, all LLMs are more vulnerable to jailbreak attacks.
  • Figure 3: Per-token KL Divergence between Original and Compliance Steered Model on Llama-3-8B-Instruct. Red lines indicate the KL Divergence on Harmbench responses, blue lines are the KL Divergence on Alpaca (Benign) responses.
  • Figure 4: Compliance steering benignizes harmful prompts in representation space (layer 30). t-SNE shows harmful (red) vs. harmless (blue) prompts. Steered harmful prompts (green) under (a) compliance and (b) JSON steering frequently cross the decision boundary and fall on the "harmless" side (60.8% for compliance, 58.2% for JSON). This illustrates a reduced safety margin, as harmful requests become easier to encode as benign-like states.
  • Figure 5: An ablation study on Llama-3-8B-Instruct by varying the coefficient of steering strength. We plot out two lines, the blue line indicates the Win-Eate which measure the ability of LLMs generating on benign questions sampled from Alpaca after steering and red line indicates the Refusal Rate on harmful questions sampled from SorryBench.
  • ...and 13 more figures