Table of Contents
Fetching ...

An Adversarial Perspective on Machine Unlearning for AI Safety

Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando

TL;DR

The paper critically evaluates state-of-the-art machine unlearning approaches (RMU, NPO, DPO) for removing hazardous knowledge from LLMs under white-box conditions. Using the WMDP benchmark and white-box attacks, it shows that unlearning largely obfuscates hazardous knowledge rather than erasing it from model weights, enabling recovery through minimal finetuning on unrelated data or orthogonalization across activation spaces. The authors demonstrate that hazardous capabilities can be recovered via directional ablations, pruning of critical neurons, and universal adversarial prefixes, while models may remain unusable or degrade in general utility. They argue that black-box evaluations are insufficient and advocate for robust, internal evaluations and new defense strategies, highlighting significant implications for the claimed advantages of unlearning over traditional safety finetuning. Overall, the work calls for more rigorous assessment of unlearning methods and prompts rethinking of how to achieve durable safety in large language models.

Abstract

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.

An Adversarial Perspective on Machine Unlearning for AI Safety

TL;DR

The paper critically evaluates state-of-the-art machine unlearning approaches (RMU, NPO, DPO) for removing hazardous knowledge from LLMs under white-box conditions. Using the WMDP benchmark and white-box attacks, it shows that unlearning largely obfuscates hazardous knowledge rather than erasing it from model weights, enabling recovery through minimal finetuning on unrelated data or orthogonalization across activation spaces. The authors demonstrate that hazardous capabilities can be recovered via directional ablations, pruning of critical neurons, and universal adversarial prefixes, while models may remain unusable or degrade in general utility. They argue that black-box evaluations are insufficient and advocate for robust, internal evaluations and new defense strategies, highlighting significant implications for the claimed advantages of unlearning over traditional safety finetuning. Overall, the work calls for more rigorous assessment of unlearning methods and prompts rethinking of how to achieve durable safety in large language models.

Abstract

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.
Paper Structure (82 sections, 5 equations, 16 figures, 12 tables, 2 algorithms)

This paper contains 82 sections, 5 equations, 16 figures, 12 tables, 2 algorithms.

Figures (16)

  • Figure 1: Conceptual overview of our contribution. Our adversarial evaluations show that current unlearning methods largely obfuscate hazardous knowledge instead of erasing it from model weights.
  • Figure 2: Accuracy on WMDP-Bio for unlearned models finetuned with different datasets and number of samples. See Appendix \ref{['app:finetuning_results_full']} for complimentary results on MMLU and WMDP-Cyber.
  • Figure 3: Accuracy on WMDP-Bio using LogitLens after each transformer block.
  • Figure 4: Performance of various models on WMDP and MMLU benchmarks after finetuning them using 5, 10, 50, 100, 500, and 1000 samples
  • Figure 5: Performance on WMDP-Bio using projections of residual stream at different stages.
  • ...and 11 more figures