Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods
Jai Doshi, Asa Cooper Stickland
TL;DR
This work critically evaluates black-box LLM unlearning methods, focusing on LLMU and RMU, by introducing a biology-centric unlearning dataset and robust prompting tests. It shows that robustness strategies such as five-shot prompting and translations can dramatically inflate unlearning benchmark accuracy, implying that the models have not truly forgotten the harmful content. Training on benign data can almost completely recover pre-unlearning capabilities, indicating the methods function more as filters than true forgetting. The findings favor RMU in preserving general capabilities and highlight the need for genuine forgetting mechanisms, supported by a framework that combines datasets, robustness tests, and multiple metrics to assess unlearning effectiveness and resilience.
Abstract
Large language model unlearning aims to remove harmful information that LLMs have learnt to prevent their use for malicious purposes. LLMU and RMU have been proposed as two methods for LLM unlearning, achieving impressive results on unlearning benchmarks. We study in detail the impact of unlearning on LLM performance metrics using the WMDP dataset as well as a new biology dataset we create. We show that unlearning has a notable impact on general model capabilities, with the performance degradation being more significant in general for LLMU. We further test the robustness of the two methods and find that doing 5-shot prompting or rephrasing the question in simple ways can lead to an over ten-fold increase in accuracy on unlearning benchmarks. Finally, we show that training on unrelated data can almost completely recover pre-unlearning performance, demonstrating that these methods fail at truly unlearning. Our methodology serves as an evaluation framework for LLM unlearning methods. The code is available at: https://github.com/JaiDoshi/Knowledge-Erasure.
