Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

Shengyuan Hu; Yiwei Fu; Zhiwei Steven Wu; Virginia Smith

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

Shengyuan Hu, Yiwei Fu, Zhiwei Steven Wu, Virginia Smith

TL;DR

The paper exposes a fundamental vulnerability in common approximate unlearning methods for LLMs by demonstrating that small, publicly available or partially related data can ‘jog’ memory and restore forgotten knowledge. It formalizes a benign relearning pipeline, defines threat models, relearn-set construction, and evaluation metrics, and tests the approach on WMDP, TOFU, and WHP benchmarks. Across both partial-data and public-information settings, relearning significantly reconstitutes previously forgotten information, including harmful or copyrighted content, indicating that obfuscation rather than true forgetting occurs. The study calls for new unlearning strategies and more robust evaluation protocols to ensure robust forgetting in the era of large language models.

Abstract

Machine unlearning is a promising approach to mitigate undesirable memorization of training data in ML models. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of $\textit{benign relearning attacks}$. With access to only a small and potentially loosely related set of data, we find that we can ''jog'' the memory of unlearned models to reverse the effects of unlearning. For example, we show that relearning on public medical articles can lead an unlearned LLM to output harmful knowledge about bioweapons, and relearning general wiki information about the book series Harry Potter can force the model to output verbatim memorized text. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study. Our work indicates that current approximate unlearning methods simply suppress the model outputs and fail to robustly forget target knowledge in the LLMs.

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

TL;DR

Abstract

. With access to only a small and potentially loosely related set of data, we find that we can ''jog'' the memory of unlearned models to reverse the effects of unlearning. For example, we show that relearning on public medical articles can lead an unlearned LLM to output harmful knowledge about bioweapons, and relearning general wiki information about the book series Harry Potter can force the model to output verbatim memorized text. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study. Our work indicates that current approximate unlearning methods simply suppress the model outputs and fail to robustly forget target knowledge in the LLMs.

Paper Structure (64 sections, 9 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 64 sections, 9 equations, 8 figures, 9 tables, 1 algorithm.

Introduction
benign relearning Attack
Problem Formulation and Threat Model
Threat model.
Relearn Set Construction
Unlearning Tasks & Evaluation
Relearning Attack Using a Portion of the Unlearn Set
Relearning Attack Using Public Information
Recovering Harmful Knowledge in WMDP
Recovering Verbatim Copyrighted Content in WHP
When is Unlearning Susceptible to Relearning? Intuition from a Simplified Example
Relearn Text Relevance vs Relearn Quality
Discussion
Model Choice.
Choice of unlearning method.
...and 49 more sections

Figures (8)

Figure 1: Recovering memorized text by relearning on public information: We ask the model to complete sentences from Harry Potter and the Order of the PhoenixRowling2003Order. We finetune the model to enforce memorization and then unlearn on the same text. Then, we show it is possible to relearn this memorized text using GPT-4-generated general information about the main characters, which does not contain direct text from the novels (see Section \ref{['sec:public']}).
Figure 2: Left: Pipeline of a relearning problem. We illustrate the case where the adversary only needs API access to the model and finetuning procedure. (The pipeline applies analogously to scenarios where the adversary has the model weights and can perform local finetuning.) The goal is to update the unlearned model so the resulting relearned model can output relevant completions not found when querying the unlearned model alone. Right: Examples of relearning data sources. In this work, we consider an adversary who either has access to public information about the query or has a limited subset of the unlearning data.
Figure 3: Attack success rate for running different relearning steps on different unlearning checkpoints. Left: TOFU, Right: WHP.
Figure 4: LLM-as-Judge scores for the forget set (WMDP benchmarks) for two models: Left: zephyr-7b-beta, Right: Llama-3-8b. For each model, we evaluate on the original model, the unlearned model and the relearned model. For each unlearning baseline column, the relearned model is obtained by finetuning the unlearned model from the same block. We use the same unlearned and relearned model for both forget and retain evaluation. Average scores over all questions are reported; scores range between $1$-$10$, with higher scores indicating better answer quality. We defer the retain MT-Bench results to Appendix \ref{['sec:wmdp_retain']} due to space constraint.
Figure 5: Average Rouge-L F1 score across 15 text-completion queries for finetuned, unlearned, and relearned model.
...and 3 more figures

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

TL;DR

Abstract

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)