Table of Contents
Fetching ...

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Wei Zhao, Zhe Li, Yige Li, Ye Zhang, Jun Sun

TL;DR

This work addresses the vulnerability of LLMs to jailbreak prompts by probing the internal safety mechanisms and proposing Layer-specific Editing (LED). LED identifies early safety layers through pruning, locates toxic layers via layer-wise decoding analysis, and edits these layers to align their outputs with safe responses, effectively reducing jailbreak success rates while preserving performance on benign prompts. Across Llama2-7B and Mistral-7B, LED substantially lowers attack success rates against multiple jailbreak methods and maintains high helpfulness, outperforming several prior defenses. The findings underscore the practical value of inner-model interventions for robust alignment, though they acknowledge limitations in erasing harmful knowledge and point to future work on deeper mechanistic understanding and broader applicability.

Abstract

Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs response to harmful prompts and propose a novel defense method termed \textbf{L}ayer-specific \textbf{Ed}iting (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical \textit{safety layers} exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from selected target layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at \url{https://github.com/ledllm/ledllm}.

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

TL;DR

This work addresses the vulnerability of LLMs to jailbreak prompts by probing the internal safety mechanisms and proposing Layer-specific Editing (LED). LED identifies early safety layers through pruning, locates toxic layers via layer-wise decoding analysis, and edits these layers to align their outputs with safe responses, effectively reducing jailbreak success rates while preserving performance on benign prompts. Across Llama2-7B and Mistral-7B, LED substantially lowers attack success rates against multiple jailbreak methods and maintains high helpfulness, outperforming several prior defenses. The findings underscore the practical value of inner-model interventions for robust alignment, though they acknowledge limitations in erasing harmful knowledge and point to future work on deeper mechanistic understanding and broader applicability.

Abstract

Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs response to harmful prompts and propose a novel defense method termed \textbf{L}ayer-specific \textbf{Ed}iting (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical \textit{safety layers} exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from selected target layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at \url{https://github.com/ledllm/ledllm}.
Paper Structure (16 sections, 5 equations, 8 figures, 6 tables)

This paper contains 16 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Example responses from normal and pruned LLMs to harmful and jailbreak prompts. When some crucial layers are removed, LLMs surprisingly provide harmful responses to unchanged harmful queries.
  • Figure 2: Left: Layer-wise pruning analysis involves selectively pruning layers and observing the changes in the responses of the pruned LLMs. When safety layers are removed, LLMs surprisingly provide harmful responses to unchanged harmful queries; Middle: Locating toxic regions that facilitate the generation of harmful responses via decoding the hidden states $h_l$ at layer $l$ into vocabulary space $\mathbf{v}_l\in\mathbb{R}^{\text{\#vocab}\times1}$; Right: Layer-specific editing first identifies layers crucial for defending against harmful prompts, and then edit these layers to enhance the robustness of LLMs where we align decoded information of all toxic layers with the safe response.
  • Figure 3: (a): The results of layer-wise pruning analysis on four different LLMs over 100 randomly selected harmful prompts from AdvBench GCG2023Zou. Higher ASR indicates lower defense performance; (b): The frequency distribution of safety layers, which are mainly distributed in early layers.
  • Figure 4: Examples of toxic score $T(h_l)$ ($l\geq26$) on Llama-7B and Mistral-7B over $100$ adversarial prompts. Layers with high toxic scores indicate that they have a high probability of outputting toxic tokens, which should be determined as toxic layers for alignment.
  • Figure 5: The performance of Mistral-7B after applying LoRA with different settings.
  • ...and 3 more figures