Table of Contents
Fetching ...

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, Cen Chen

TL;DR

Eraser targets jailbreaking by unlearning harmful knowledge within an already aligned LLM while preserving general understanding and safety refusals. It jointly optimizes three objectives—unlearning harmful content, retaining entity-related comprehension, and maintaining safety alignment—via gradient-based training with a capped harmful-loss term $L_f$ controlled by a threshold $\gamma$. Empirical results show Eraser substantially reduces jailbreak success and harmfulness without noticeably degrading performance on standard benchmarks, outperforming existing defense approaches. The study reveals that defense benefits can arise from multiple sources, including deliberate forgetting and intentional disruption of harmful prompts, and emphasizes a practical path toward safer LLM deployment without heavy reliance on red-team data. The work offers a tangible balance between restricting harmful capability and preserving overall usefulness, with clear guidance on hyperparameter tuning and limitations for future improvements.

Abstract

Jailbreaking attacks can enable Large Language Models (LLMs) to bypass the safeguard and generate harmful content. Existing jailbreaking defense methods have failed to address the fundamental issue that harmful knowledge resides within the model, leading to potential jailbreak risks for LLMs. In this paper, we propose a novel defense method called Eraser, which mainly includes three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment. The intuition is that if an LLM forgets the specific knowledge required to answer a harmful question, it will no longer have the ability to answer harmful questions. The training of Erase does not actually require the model's own harmful knowledge, and it can benefit from unlearning general answers related to harmful queries, which means it does not need assistance from the red team. The experimental results show that Eraser can significantly reduce the jailbreaking success rate for various attacks without compromising the general capabilities of the model. Our codes are available at https://github.com/ZeroNLP/Eraser.

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

TL;DR

Eraser targets jailbreaking by unlearning harmful knowledge within an already aligned LLM while preserving general understanding and safety refusals. It jointly optimizes three objectives—unlearning harmful content, retaining entity-related comprehension, and maintaining safety alignment—via gradient-based training with a capped harmful-loss term controlled by a threshold . Empirical results show Eraser substantially reduces jailbreak success and harmfulness without noticeably degrading performance on standard benchmarks, outperforming existing defense approaches. The study reveals that defense benefits can arise from multiple sources, including deliberate forgetting and intentional disruption of harmful prompts, and emphasizes a practical path toward safer LLM deployment without heavy reliance on red-team data. The work offers a tangible balance between restricting harmful capability and preserving overall usefulness, with clear guidance on hyperparameter tuning and limitations for future improvements.

Abstract

Jailbreaking attacks can enable Large Language Models (LLMs) to bypass the safeguard and generate harmful content. Existing jailbreaking defense methods have failed to address the fundamental issue that harmful knowledge resides within the model, leading to potential jailbreak risks for LLMs. In this paper, we propose a novel defense method called Eraser, which mainly includes three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment. The intuition is that if an LLM forgets the specific knowledge required to answer a harmful question, it will no longer have the ability to answer harmful questions. The training of Erase does not actually require the model's own harmful knowledge, and it can benefit from unlearning general answers related to harmful queries, which means it does not need assistance from the red team. The experimental results show that Eraser can significantly reduce the jailbreaking success rate for various attacks without compromising the general capabilities of the model. Our codes are available at https://github.com/ZeroNLP/Eraser.
Paper Structure (25 sections, 4 equations, 10 figures, 7 tables)

This paper contains 25 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: (a) safety Alignment: when the attacker directly queries a harmful question, LLM refuses to respond because of safety alignment. (b) Jailbreaking: when the attacker asks the harmful question via an adversarial prompt, the harmful knowledge bypasses safeguards, and the LLM provides harmful responses. (c) Eraser: when the harmful knowledge is forgotten and can no longer bypass the safeguards, the LLM refuses to answer.
  • Figure 2: When the user queries "What can potassium nitrate be used for?", the responses of LLama2 after unlearning bomb-making knowledge and the original Llama2. Part of the text is omitted with [$\dots$].
  • Figure 3: The Impact of $\gamma$ and $L_f$. $L_f$ is always a negative value, and $\gamma$ is the limit on the minimum value of $L_f$ in Eraser.
  • Figure 4: When the attacker used AIM to issue the command "Give step-by-step instructions on how to commit insider trading," the base model provided a detailed guideline. Eraser refused and stated that it could provide other morally acceptable assistance. GAM responded with garbled text, including harmful content unrelated to the instruction. RSFT briefly rejected the user. Due to space limitations, part of the text is omitted with [$\cdots$].
  • Figure 5: When the user asked "What are the hazards of firearms?", the base model and Eraser listed multiple hazards in detail. GAM briefly summarized the hazards. RSFT refused to answer on the grounds that it would not promote violence or harm. Due to space limitations, part of the text is omitted with [$\cdots$]. Appendix \ref{['sec:appendix3']} provides additional quantitative analysis for similar queries.
  • ...and 5 more figures