Table of Contents
Fetching ...

Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods

Jai Doshi, Asa Cooper Stickland

TL;DR

This work critically evaluates black-box LLM unlearning methods, focusing on LLMU and RMU, by introducing a biology-centric unlearning dataset and robust prompting tests. It shows that robustness strategies such as five-shot prompting and translations can dramatically inflate unlearning benchmark accuracy, implying that the models have not truly forgotten the harmful content. Training on benign data can almost completely recover pre-unlearning capabilities, indicating the methods function more as filters than true forgetting. The findings favor RMU in preserving general capabilities and highlight the need for genuine forgetting mechanisms, supported by a framework that combines datasets, robustness tests, and multiple metrics to assess unlearning effectiveness and resilience.

Abstract

Large language model unlearning aims to remove harmful information that LLMs have learnt to prevent their use for malicious purposes. LLMU and RMU have been proposed as two methods for LLM unlearning, achieving impressive results on unlearning benchmarks. We study in detail the impact of unlearning on LLM performance metrics using the WMDP dataset as well as a new biology dataset we create. We show that unlearning has a notable impact on general model capabilities, with the performance degradation being more significant in general for LLMU. We further test the robustness of the two methods and find that doing 5-shot prompting or rephrasing the question in simple ways can lead to an over ten-fold increase in accuracy on unlearning benchmarks. Finally, we show that training on unrelated data can almost completely recover pre-unlearning performance, demonstrating that these methods fail at truly unlearning. Our methodology serves as an evaluation framework for LLM unlearning methods. The code is available at: https://github.com/JaiDoshi/Knowledge-Erasure.

Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods

TL;DR

This work critically evaluates black-box LLM unlearning methods, focusing on LLMU and RMU, by introducing a biology-centric unlearning dataset and robust prompting tests. It shows that robustness strategies such as five-shot prompting and translations can dramatically inflate unlearning benchmark accuracy, implying that the models have not truly forgotten the harmful content. Training on benign data can almost completely recover pre-unlearning capabilities, indicating the methods function more as filters than true forgetting. The findings favor RMU in preserving general capabilities and highlight the need for genuine forgetting mechanisms, supported by a framework that combines datasets, robustness tests, and multiple metrics to assess unlearning effectiveness and resilience.

Abstract

Large language model unlearning aims to remove harmful information that LLMs have learnt to prevent their use for malicious purposes. LLMU and RMU have been proposed as two methods for LLM unlearning, achieving impressive results on unlearning benchmarks. We study in detail the impact of unlearning on LLM performance metrics using the WMDP dataset as well as a new biology dataset we create. We show that unlearning has a notable impact on general model capabilities, with the performance degradation being more significant in general for LLMU. We further test the robustness of the two methods and find that doing 5-shot prompting or rephrasing the question in simple ways can lead to an over ten-fold increase in accuracy on unlearning benchmarks. Finally, we show that training on unrelated data can almost completely recover pre-unlearning performance, demonstrating that these methods fail at truly unlearning. Our methodology serves as an evaluation framework for LLM unlearning methods. The code is available at: https://github.com/JaiDoshi/Knowledge-Erasure.

Paper Structure

This paper contains 28 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Applying unlearning techniques on the pretrained model makes it give nonsensical outputs to harmful queries. Adversarial prompting (in this case adding filler text before the question) or fine-tuning on benign data causes unlearned capabilities to resurface.
  • Figure 2: Peformance on unlearning benchmarks Vs. MMLU and MT-Bench performance. Top-right direction indicates better performance. The maximum accuracy from the robustness tests listed in Table \ref{['tab:robustness tests']} is used as the accuracy on the unlearning benchmarks, as we consider the accuracy after applying the robustness tests a more accurate measure of the degree of unlearning.