Table of Contents
Fetching ...

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Lang Gao, Jiahui Geng, Xiangliang Zhang, Preslav Nakov, Xiuying Chen

TL;DR

The paper addresses jailbreaking risks in large language models by performing a large-scale analysis that reveals jailbreaking shifts harmful activations outside an implicit safety boundary, especially in the low and middle layers. It introduces Activation Boundary Defense (ABD), a lightweight, adaptive penalty-based method guided by Bayesian optimization to constrain activations within the boundary while preserving general capabilities. ABD achieves an average Defense Success Rate (DSR) above 98% against diverse jailbreak attacks, with less than 2% degradation in broad competencies, and operates with minimal overhead. The work reconciles previous ambiguities about jailbreak mechanisms, demonstrates practical defense scalability, and outlines directions for extending the approach to multi-turn interactions and broader model families.

Abstract

Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. Yet, there is still insufficient understanding of how jailbreaking works, which makes it hard to develop effective defense strategies. We aim to shed more light into this issue: we conduct a detailed large-scale analysis of seven different jailbreak methods and find that these disagreements stem from insufficient observation samples. In particular, we introduce \textit{safety boundary}, and we find that jailbreaks shift harmful activations outside that safety boundary, where LLMs are less sensitive to harmful information. We also find that the low and the middle layers are critical in such shifts, while deeper layers have less impact. Leveraging on these insights, we propose a novel defense called \textbf{Activation Boundary Defense} (ABD), which adaptively constrains the activations within the safety boundary. We further use Bayesian optimization to selectively apply the defense method to the low and the middle layers. Our experiments on several benchmarks show that ABD achieves an average DSR of over 98\% against various forms of jailbreak attacks, with less than 2\% impact on the model's general capabilities.

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

TL;DR

The paper addresses jailbreaking risks in large language models by performing a large-scale analysis that reveals jailbreaking shifts harmful activations outside an implicit safety boundary, especially in the low and middle layers. It introduces Activation Boundary Defense (ABD), a lightweight, adaptive penalty-based method guided by Bayesian optimization to constrain activations within the boundary while preserving general capabilities. ABD achieves an average Defense Success Rate (DSR) above 98% against diverse jailbreak attacks, with less than 2% degradation in broad competencies, and operates with minimal overhead. The work reconciles previous ambiguities about jailbreak mechanisms, demonstrates practical defense scalability, and outlines directions for extending the approach to multi-turn interactions and broader model families.

Abstract

Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. Yet, there is still insufficient understanding of how jailbreaking works, which makes it hard to develop effective defense strategies. We aim to shed more light into this issue: we conduct a detailed large-scale analysis of seven different jailbreak methods and find that these disagreements stem from insufficient observation samples. In particular, we introduce \textit{safety boundary}, and we find that jailbreaks shift harmful activations outside that safety boundary, where LLMs are less sensitive to harmful information. We also find that the low and the middle layers are critical in such shifts, while deeper layers have less impact. Leveraging on these insights, we propose a novel defense called \textbf{Activation Boundary Defense} (ABD), which adaptively constrains the activations within the safety boundary. We further use Bayesian optimization to selectively apply the defense method to the low and the middle layers. Our experiments on several benchmarks show that ABD achieves an average DSR of over 98\% against various forms of jailbreak attacks, with less than 2\% impact on the model's general capabilities.

Paper Structure

This paper contains 73 sections, 10 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Projected activation space overview of Vicuna-7B-v1.3 across different layers. Harmful activations are observed to cluster together, and we define the surrounding boundary as the safety boundary. The attack arrow indicates that jailbreak prompts shift harmful activations into the benign space to evade safety checks.
  • Figure 2: PCA visualizations reveal the limitations of small-sample sizes: (a) Top: 100 samples per type he2024jailbreaklensinterpretingjailbreakmechanism; Bottom: 5,000 benign and harmful samples. (b) Top: 60 samples per type zhao2024eeg; Bottom: 500 samples per type.
  • Figure 3: (a) Impact of random activation shifts across layers. DSR (Defense Success Rate) decreases as shifting distance increases, regardless of the affected layers. (b) MVD (Most Vulnerable Distance) across layers. MVD increases as layers go deeper. (c) Inclusion ratio of jailbreaking activations in the harmful activation space. Without ABD, the ratio stays below 0.4 but rises to 1 when ABD is applied.
  • Figure 4: Workflow of ABD. ABD restricts outlier activation coordinates using a penalty function and determines its application scope via BO-based tuning.
  • Figure 5: t-SNE visualization of activations in layer 14. Left: Results with 60 samples per type following shen2024jailbreak, showing jailbreak activations between harmful and benign activations. Right: Results after scaling up to 500 samples per type, showing jailbreak activations clustering on the harmful side.
  • ...and 1 more figures