Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models
Chen Xiong, Zhiyuan He, Pin-Yu Chen, Ching-Yun Ko, Tsung-Yi Ho
TL;DR
This work shows that activation steering, a practical post-training technique to steer LLMs toward benign utilities, can unintentionally erode safety margins and heighten jailbreak risk. By analyzing two benign steering paradigms (STEER-COMPLIANCE and STEER-JSON) across multiple models, the authors demonstrate both intrinsic safety regressions and amplified susceptibility to black-box jailbreaks like CoP, PAIR, and TAP on HarmBench. The paper provides mechanistic evidence—prefix-level autoregressive effects and hidden-space representation shifts—that explain how early-generation dynamics and internal encodings become more permissive toward harmful outputs. To mitigate these externalities, it proposes STEER-BIND, a safety-aware steering approach, and emphasizes the need for red-teaming and safety audits for steered deployments. The findings highlight a critical safety blind spot in deployment pipelines and urge developing robust, safety-conscious decoding-time control methods.
Abstract
Activation steering is a practical post-training model alignment technique to enhance the utility of Large Language Models (LLMs). Prior to deploying a model as a service, developers can steer a pre-trained model toward specific behavioral objectives, such as compliance or instruction adherence, without the need for retraining. This process is as simple as adding a steering vector to the model's internal representations. However, this capability unintentionally introduces critical and under-explored safety risks. We identify a phenomenon termed Steering Externalities, where steering vectors derived from entirely benign datasets-such as those enforcing strict compliance or specific output formats like JSON-inadvertently erode safety guardrails. Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks by bypassing the initial safety alignment. Ultimately, our results expose a critical blind spot in deployment: benign activation steering systematically erodes the "safety margin," rendering models more vulnerable to black-box attacks and proving that inference-time utility improvements must be rigorously audited for unintended safety externalities.
