Table of Contents
Fetching ...

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu

TL;DR

<3-5 sentence high-level summary> AdaSteer addresses the persistent vulnerability of safety-aligned LLMs to jailbreak attacks by introducing a training-free, adaptive activation steering framework that operates along two interpretable directions: Rejection Direction $\boldsymbol{v}_{\text{RD}}$ and Harmfulness Direction $\boldsymbol{v}_{\text{HD}}$. By learning two logistic-regression-based laws that map input-derived coordinates $pos_{\text{RD}}$ and $pos_{\text{HD}}$ to steering strengths $\lambda_r$ and $\lambda_c$, AdaSteer dynamically tailors defense to input characteristics, reducing jailbreak success while preserving benign utility. Empirical results across LLaMA-3.1, Gemma-2, and Qwen2.5 show AdaSteer outperforms state-of-the-art, training-free defenses across seven jailbreak attacks and multilingual scenarios, with minimal impact on performance on safe prompts. The work demonstrates that interpretable, per-input safety controls can be realized in real time without fine-tuning, offering scalable defenses for aligned LLMs in practical deployments.

Abstract

Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

TL;DR

<3-5 sentence high-level summary> AdaSteer addresses the persistent vulnerability of safety-aligned LLMs to jailbreak attacks by introducing a training-free, adaptive activation steering framework that operates along two interpretable directions: Rejection Direction and Harmfulness Direction . By learning two logistic-regression-based laws that map input-derived coordinates and to steering strengths and , AdaSteer dynamically tailors defense to input characteristics, reducing jailbreak success while preserving benign utility. Empirical results across LLaMA-3.1, Gemma-2, and Qwen2.5 show AdaSteer outperforms state-of-the-art, training-free defenses across seven jailbreak attacks and multilingual scenarios, with minimal impact on performance on safe prompts. The work demonstrates that interpretable, per-input safety controls can be realized in real time without fine-tuning, offering scalable defenses for aligned LLMs in practical deployments.

Abstract

Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.

Paper Structure

This paper contains 56 sections, 11 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The overall comparison between previous activation steering and our AdaSteer. (a) The two-step paradigm of activation steering, with the fixed steering coefficient $\lambda$. (b) Deriving rejection law and harmfulness law. (c) We propose AdaSteer to achieve real-time, adaptive and input-dependent jailbreak defense.
  • Figure 2: The visualization of $pos_{\text{RD}}$ and $pos_{\text{HD}}$ for each input. The value in parentheses next to each jailbreak method in the legend indicates the average $\lambda_r$ needed to cause the model to reject all inputs.
  • Figure 3: The results of AdaSteer across different sizes of Qwen2.5. The values above the bars represent the original model’s performance, while the values below the line indicate that after applying AdaSteer.
  • Figure 4: Trade-off between inference efficiency and jailbreak defense success rate (DSR).
  • Figure 5: Trade-off between Compliance Rate (CR) and jailbreak defense success rate (DSR).
  • ...and 2 more figures