Table of Contents
Fetching ...

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua

TL;DR

AlphaSteer addresses the safety–utility trade-off in activation steering for refusing malicious prompts by introducing a learnable, null-space-constrained framework. It learns layer-wise transformations $\mathbf{\Delta}^{(l)}$ to steer activations via $\mathbf{h}^{(l)'} = \mathbf{h}^{(l)} + \lambda \mathbf{\Delta}^{(l)} \mathbf{h}^{(l)}$, while enforcing $\mathbf{\Delta} \mathbf{H}_b = \mathbf{0}$ through a null-space projection $\hat{\mathbf{P}}$ so benign prompts remain unchanged. For malicious prompts, AlphaSteer reconstructs a refusal direction with a regularized least-squares objective, yielding a closed-form solution $\tilde{\mathbf{\Delta}}^{\star}$ that, when applied, biases activations toward refusal, thereby enhancing safety. Across multiple LLMs and jailbreak attacks, AlphaSteer achieves high defense success rates with minimal utility loss, and experiments reveal favorable activation-space dynamics and robustness to unseen threats, all while retaining practical inference efficiency and providing open-source implementation.

Abstract

As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and utility, we present a theoretically grounded and empirically effective activation steering method called AlphaSteer. Specifically, it considers activation steering as a learnable process with two principled learning objectives: utility preservation and safety enhancement. For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints. For safety enhancement, it learns to construct a refusal direction vector for steering malicious data, with the help of linear regression. Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer, which significantly improves the safety of LLMs without compromising general capabilities. Our codes are available at https://github.com/AlphaLab-USTC/AlphaSteer.

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

TL;DR

AlphaSteer addresses the safety–utility trade-off in activation steering for refusing malicious prompts by introducing a learnable, null-space-constrained framework. It learns layer-wise transformations to steer activations via , while enforcing through a null-space projection so benign prompts remain unchanged. For malicious prompts, AlphaSteer reconstructs a refusal direction with a regularized least-squares objective, yielding a closed-form solution that, when applied, biases activations toward refusal, thereby enhancing safety. Across multiple LLMs and jailbreak attacks, AlphaSteer achieves high defense success rates with minimal utility loss, and experiments reveal favorable activation-space dynamics and robustness to unseen threats, all while retaining practical inference efficiency and providing open-source implementation.

Abstract

As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and utility, we present a theoretically grounded and empirically effective activation steering method called AlphaSteer. Specifically, it considers activation steering as a learnable process with two principled learning objectives: utility preservation and safety enhancement. For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints. For safety enhancement, it learns to construct a refusal direction vector for steering malicious data, with the help of linear regression. Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer, which significantly improves the safety of LLMs without compromising general capabilities. Our codes are available at https://github.com/AlphaLab-USTC/AlphaSteer.

Paper Structure

This paper contains 54 sections, 1 theorem, 34 equations, 24 figures, 18 tables, 1 algorithm.

Key Result

Lemma 1

Let $\mathbf{H}_b \in \mathbb{R}^{d \times N_b}$ be a high-dimensional utility activation matrix. Then the null space of $\mathbf{H}_b$ is equivalent to the null space of its non-central covariance matrix $\mathbf{H}_b \mathbf{H}_b^\top \in \mathbb{R}^{d \times d}$: $\emph{Null}(\mathbf{H}_b) = \emp

Figures (24)

  • Figure 1: PCA visualization of the steering effect on activations of benign and malicious prompts (i.e., jailbreak attacks). (\ref{['fig:baseline-pca']}) Effect of Surgical Surgical. (\ref{['fig:alphasteer-pca']}) Effect of AlphaSteer. Surgical distorts activations of benign prompts while AlphaSteer maintains them almost unaffected.
  • Figure 2: The mechanism of AlphaSteer, which dynamically constructs a steering vector $\mathbf{s}$ according to the activation $\mathbf{h}$ with a learned transformation matrix$\mathbf{\Delta}$. For benign prompts, it constructs a nearly zero steering vector $\mathbf{0}$, which has little effect on the activation. For malicious prompts, it constructs a refusal direction vector $\mathbf{r}$, which will steer the activation into a region of refusal.
  • Figure 3: (\ref{['fig:benign']}, \ref{['fig:harmful']}) The PCA visualization of the activation dynamics with different steering strengths on benign and malicious prompts. (\ref{['fig:norm-distribution']}) The L2 norm distribution of steering vectors.
  • Figure 4: (\ref{['fig:effect-lambda-performance-main']}) The performance of AlphaSteer under different steering strengths. (\ref{['fig:steering-strength-impact-llama-main']}) AlphaSteer maintains high utility scores across different DSR.
  • Figure 5: Case study of how AlphaSteer affects the response on malicious and benign prompts on Llama-3.1-8B-Instruct. The malicious prompt is constructed by ReNeLLM ReNeLLM.
  • ...and 19 more figures

Theorems & Definitions (2)

  • Definition 1: Null Space linear-algebra
  • Lemma 1: Null Space Equivalence for Computational Efficiency AlphaEdit