Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
Wei Zhao, Zhe Li, Yige Li, Ye Zhang, Jun Sun
TL;DR
This work addresses the vulnerability of LLMs to jailbreak prompts by probing the internal safety mechanisms and proposing Layer-specific Editing (LED). LED identifies early safety layers through pruning, locates toxic layers via layer-wise decoding analysis, and edits these layers to align their outputs with safe responses, effectively reducing jailbreak success rates while preserving performance on benign prompts. Across Llama2-7B and Mistral-7B, LED substantially lowers attack success rates against multiple jailbreak methods and maintains high helpfulness, outperforming several prior defenses. The findings underscore the practical value of inner-model interventions for robust alignment, though they acknowledge limitations in erasing harmful knowledge and point to future work on deeper mechanistic understanding and broader applicability.
Abstract
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs response to harmful prompts and propose a novel defense method termed \textbf{L}ayer-specific \textbf{Ed}iting (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical \textit{safety layers} exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from selected target layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at \url{https://github.com/ledllm/ledllm}.
