Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning
Shengyuan Hu, Yiwei Fu, Zhiwei Steven Wu, Virginia Smith
TL;DR
The paper exposes a fundamental vulnerability in common approximate unlearning methods for LLMs by demonstrating that small, publicly available or partially related data can ‘jog’ memory and restore forgotten knowledge. It formalizes a benign relearning pipeline, defines threat models, relearn-set construction, and evaluation metrics, and tests the approach on WMDP, TOFU, and WHP benchmarks. Across both partial-data and public-information settings, relearning significantly reconstitutes previously forgotten information, including harmful or copyrighted content, indicating that obfuscation rather than true forgetting occurs. The study calls for new unlearning strategies and more robust evaluation protocols to ensure robust forgetting in the era of large language models.
Abstract
Machine unlearning is a promising approach to mitigate undesirable memorization of training data in ML models. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of $\textit{benign relearning attacks}$. With access to only a small and potentially loosely related set of data, we find that we can ''jog'' the memory of unlearned models to reverse the effects of unlearning. For example, we show that relearning on public medical articles can lead an unlearned LLM to output harmful knowledge about bioweapons, and relearning general wiki information about the book series Harry Potter can force the model to output verbatim memorized text. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study. Our work indicates that current approximate unlearning methods simply suppress the model outputs and fail to robustly forget target knowledge in the LLMs.
