Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, Cen Chen
TL;DR
Eraser targets jailbreaking by unlearning harmful knowledge within an already aligned LLM while preserving general understanding and safety refusals. It jointly optimizes three objectives—unlearning harmful content, retaining entity-related comprehension, and maintaining safety alignment—via gradient-based training with a capped harmful-loss term $L_f$ controlled by a threshold $\gamma$. Empirical results show Eraser substantially reduces jailbreak success and harmfulness without noticeably degrading performance on standard benchmarks, outperforming existing defense approaches. The study reveals that defense benefits can arise from multiple sources, including deliberate forgetting and intentional disruption of harmful prompts, and emphasizes a practical path toward safer LLM deployment without heavy reliance on red-team data. The work offers a tangible balance between restricting harmful capability and preserving overall usefulness, with clear guidance on hyperparameter tuning and limitations for future improvements.
Abstract
Jailbreaking attacks can enable Large Language Models (LLMs) to bypass the safeguard and generate harmful content. Existing jailbreaking defense methods have failed to address the fundamental issue that harmful knowledge resides within the model, leading to potential jailbreak risks for LLMs. In this paper, we propose a novel defense method called Eraser, which mainly includes three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment. The intuition is that if an LLM forgets the specific knowledge required to answer a harmful question, it will no longer have the ability to answer harmful questions. The training of Erase does not actually require the model's own harmful knowledge, and it can benefit from unlearning general answers related to harmful queries, which means it does not need assistance from the red team. The experimental results show that Eraser can significantly reduce the jailbreaking success rate for various attacks without compromising the general capabilities of the model. Our codes are available at https://github.com/ZeroNLP/Eraser.
