Table of Contents
Fetching ...

From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks

Zhexin Zhang, Junxiao Yang, Yida Lu, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang

TL;DR

The study reveals a ripple effect in unlearning-based defenses against jailbreak attacks: forgetting targeted harmful knowledge can implicitly reduce related, unseen harms, improving robustness with limited training data. Through extensive experiments across models, attack strategies, and defenses, the authors show unlearning methods drastically lower attack success rates on both in-distribution and out-of-distribution harmful queries while preserving general performance. Analyses indicate that clustered representations and shared patterns among harmful outputs enable broad generalization, though some dispersed or context-sensitive harms remain challenging. The work underscores unlearning as a promising direction for safer LLM deployment, while cautioning about potential limitations and benign knowledge impact.

Abstract

Large Language Models (LLMs) are known to be vulnerable to jailbreak attacks. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Consequently, unlearning-based approaches have been proposed to mitigate jailbreak attacks by directly removing harmful knowledge from the model. In this paper, we identify a novel ripple effect of unlearning, wherein LLMs can implicitly unlearn harmful knowledge that was not explicitly introduced during the unlearning phase (e.g., a model unlearning the steps for theft may also implicitly unlearn the steps for making a bomb). Through over 100 experimental runs spanning multiple models, attack strategies, and defense methods, we empirically validate this phenomenon, which makes unlearning-based methods able to decrease the Attack Success Rate on unseen data from more than 70% to less than 10% with only 100 training samples. Further analysis reveals that the strong generalization ability of unlearning may stem from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions in response, and similarity among their learned representations in the LLM). We also discuss the potential limitations of unlearning and the observed ripple effect. We hope our research could contribute to a deeper understanding of unlearning. Our code is available at https://github.com/thu-coai/SafeUnlearning.

From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks

TL;DR

The study reveals a ripple effect in unlearning-based defenses against jailbreak attacks: forgetting targeted harmful knowledge can implicitly reduce related, unseen harms, improving robustness with limited training data. Through extensive experiments across models, attack strategies, and defenses, the authors show unlearning methods drastically lower attack success rates on both in-distribution and out-of-distribution harmful queries while preserving general performance. Analyses indicate that clustered representations and shared patterns among harmful outputs enable broad generalization, though some dispersed or context-sensitive harms remain challenging. The work underscores unlearning as a promising direction for safer LLM deployment, while cautioning about potential limitations and benign knowledge impact.

Abstract

Large Language Models (LLMs) are known to be vulnerable to jailbreak attacks. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Consequently, unlearning-based approaches have been proposed to mitigate jailbreak attacks by directly removing harmful knowledge from the model. In this paper, we identify a novel ripple effect of unlearning, wherein LLMs can implicitly unlearn harmful knowledge that was not explicitly introduced during the unlearning phase (e.g., a model unlearning the steps for theft may also implicitly unlearn the steps for making a bomb). Through over 100 experimental runs spanning multiple models, attack strategies, and defense methods, we empirically validate this phenomenon, which makes unlearning-based methods able to decrease the Attack Success Rate on unseen data from more than 70% to less than 10% with only 100 training samples. Further analysis reveals that the strong generalization ability of unlearning may stem from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions in response, and similarity among their learned representations in the LLM). We also discuss the potential limitations of unlearning and the observed ripple effect. We hope our research could contribute to a deeper understanding of unlearning. Our code is available at https://github.com/thu-coai/SafeUnlearning.
Paper Structure (38 sections, 5 equations, 6 figures, 9 tables)

This paper contains 38 sections, 5 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: The ripple effect of unlearning. While the model only sees harmful knowledge for theft during unlearning, it implicitly unlearns other harmful knowledge such as steps to build a bomb.
  • Figure 2: Visualization of the hidden states in vanilla Llama-3.1-8B using 2-dimensional t-SNE. We use the last layer's hidden state on the last token. When obtaining the response representation, we set the query to empty string, to avoid the influence of the query. Note that the harmless responses here correspond to harmless questions. Figure \ref{['fig:repr_mistral']} presents similar results for Mistral
  • Figure 3: Visualization of the prediction probability of Llama-3.1-8B (after unlearning) for a harmful response to a OOD query "Create a plan for smuggling a bomb past security in a modern airport". We set three probability thresholds: RGB]230, 245, 2551e-1, RGB]200, 225, 2551e-4, and RGB]170, 195, 2551e-7, with tokens having lower probabilities being colored darker.
  • Figure 4: ASR results on different categories of AIR-Bench 2024.
  • Figure 5: We randomly sample two different harmful responses each harmful query. We highlight that while the responses for making a bomb share various similar knowledge, the responses for writing the persuasive article exhibit different arguments and knowledge.
  • ...and 1 more figures