Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models
Lang Gao, Jiahui Geng, Xiangliang Zhang, Preslav Nakov, Xiuying Chen
TL;DR
The paper addresses jailbreaking risks in large language models by performing a large-scale analysis that reveals jailbreaking shifts harmful activations outside an implicit safety boundary, especially in the low and middle layers. It introduces Activation Boundary Defense (ABD), a lightweight, adaptive penalty-based method guided by Bayesian optimization to constrain activations within the boundary while preserving general capabilities. ABD achieves an average Defense Success Rate (DSR) above 98% against diverse jailbreak attacks, with less than 2% degradation in broad competencies, and operates with minimal overhead. The work reconciles previous ambiguities about jailbreak mechanisms, demonstrates practical defense scalability, and outlines directions for extending the approach to multi-turn interactions and broader model families.
Abstract
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. Yet, there is still insufficient understanding of how jailbreaking works, which makes it hard to develop effective defense strategies. We aim to shed more light into this issue: we conduct a detailed large-scale analysis of seven different jailbreak methods and find that these disagreements stem from insufficient observation samples. In particular, we introduce \textit{safety boundary}, and we find that jailbreaks shift harmful activations outside that safety boundary, where LLMs are less sensitive to harmful information. We also find that the low and the middle layers are critical in such shifts, while deeper layers have less impact. Leveraging on these insights, we propose a novel defense called \textbf{Activation Boundary Defense} (ABD), which adaptively constrains the activations within the safety boundary. We further use Bayesian optimization to selectively apply the defense method to the low and the middle layers. Our experiments on several benchmarks show that ABD achieves an average DSR of over 98\% against various forms of jailbreak attacks, with less than 2\% impact on the model's general capabilities.
