Table of Contents
Fetching ...

Do Unlearning Methods Remove Information from Language Model Weights?

Aghyad Deeb, Fabien Roger

TL;DR

The paper tackles the risk that unlearning in language models may only obscure dangerous knowledge rather than remove it from weights. It introduces Retraining on T (RTT), an adversarial evaluation framework that measures information removal by attempting to recover hidden facts after unlearning through retraining on a separate fact set T. Across pretrained and fine-tuned information, RTT reveals that current unlearning methods often leave substantial information retrievable, with pretrained facts showing high recoveries and fine-tuned facts being more resistant to recovery. The work highlights the need for explicit removal guarantees, provides a rich dataset and methodological framework, and recommends rigorous, recovery-focused evaluations to improve safety guarantees in LLM unlearning.

Abstract

Large Language Models' knowledge of how to perform cyber-security attacks, create bioweapons, and manipulate humans poses risks of misuse. Previous work has proposed methods to unlearn this knowledge. Historically, it has been unclear whether unlearning techniques are removing information from the model weights or just making it harder to access. To disentangle these two objectives, we propose an adversarial evaluation method to test for the removal of information from model weights: we give an attacker access to some facts that were supposed to be removed, and using those, the attacker tries to recover other facts from the same distribution that cannot be guessed from the accessible facts. We show that using fine-tuning on the accessible facts can recover 88% of the pre-unlearning accuracy when applied to current unlearning methods for information learned during pretraining, revealing the limitations of these methods in removing information from the model weights. Our results also suggest that unlearning evaluations that measure unlearning robustness on information learned during an additional fine-tuning phase may overestimate robustness compared to evaluations that attempt to unlearn information learned during pretraining.

Do Unlearning Methods Remove Information from Language Model Weights?

TL;DR

The paper tackles the risk that unlearning in language models may only obscure dangerous knowledge rather than remove it from weights. It introduces Retraining on T (RTT), an adversarial evaluation framework that measures information removal by attempting to recover hidden facts after unlearning through retraining on a separate fact set T. Across pretrained and fine-tuned information, RTT reveals that current unlearning methods often leave substantial information retrievable, with pretrained facts showing high recoveries and fine-tuned facts being more resistant to recovery. The work highlights the need for explicit removal guarantees, provides a rich dataset and methodological framework, and recommends rigorous, recovery-focused evaluations to improve safety guarantees in LLM unlearning.

Abstract

Large Language Models' knowledge of how to perform cyber-security attacks, create bioweapons, and manipulate humans poses risks of misuse. Previous work has proposed methods to unlearn this knowledge. Historically, it has been unclear whether unlearning techniques are removing information from the model weights or just making it harder to access. To disentangle these two objectives, we propose an adversarial evaluation method to test for the removal of information from model weights: we give an attacker access to some facts that were supposed to be removed, and using those, the attacker tries to recover other facts from the same distribution that cannot be guessed from the accessible facts. We show that using fine-tuning on the accessible facts can recover 88% of the pre-unlearning accuracy when applied to current unlearning methods for information learned during pretraining, revealing the limitations of these methods in removing information from the model weights. Our results also suggest that unlearning evaluations that measure unlearning robustness on information learned during an additional fine-tuning phase may overestimate robustness compared to evaluations that attempt to unlearn information learned during pretraining.

Paper Structure

This paper contains 41 sections, 4 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Our approach to evaluate unlearning: we try to recover potentially hidden facts by retraining on facts independent of the facts used for evaluation but coming from the same distribution (left). Using this procedure, we find that we are able to recover a large fraction of performance when using state-of-the-art unlearning methods like RMU li2024wmdp (right). We show examples of independent facts in Appendix \ref{['app:indp_facts']}.
  • Figure 2: Forget accuracies before and after RTT for different unlearning methods and Datasets. We perform unlearning using RMU, GD, and RIA then perform RTT. The unlearning strength is chosen such that the drop in the retain accuracy is less than or equal to 5%, where the unlearning strength is controlled by adjusting the corresponding hyperparameter (see Section \ref{['sub_sec:unlearning']}) in each unlearning method. The results for a retain accuracy drop of less than or equal to 10%, 30% and 100% are available in Appendix \ref{['app:diff_drops_retain_main_result']}.
  • Figure 3: The tradeoff between the forget accuracy and the retain accuracy on the Years dataset when using RMU for values of retain coefficient $\alpha$ between $0$ and $10^3$ (smaller retain coefficient leads to stronger unlearning). When increasing the unlearning strength, the forget accuracy decreases before the retain accuracy drops too much, but when choosing an unlearning strength so high that the retain accuracy drops to 25%, the forget accuracy after RTT remains high.
  • Figure 4: Forget accuracies for different formats of the unlearning dataset. We perform unlearning and RTT for different text formats and loss types when using RMU and GD. The unlearning strength is such that the loss in the retain accuracy is less than or equal to 5%. All of the runs were done using the WMDP-Deduped dataset. ( For "MCQ with Loss on Answer Only", we do not consider RMU as RMU acts on activations of intermediate layers and we cannot restrict the loss on answer tokens.)
  • Figure 5: Our approach to creating a model that hides knowledge: by controlling which layers are fine-tuned, we ensure that the information is still present in the model weights.
  • ...and 5 more figures