Distillation Robustifies Unlearning
Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, Alexander Matt Turner
TL;DR
This work addresses the fragility of existing unlearning methods in LLMs, showing that even oracle-matched behavior can leave latent capabilities that are easily relearned. It demonstrates that distilling outputs from an unlearned model into a randomly initialized student transfers desired behavior while discarding latent forget traces, enabling robust unlearning. The authors introduce UNDO (Unlearn-Noise-Distill-on-Outputs), a tunable method that degrades parameters via controlled corruption and distills to recover the teacher’s behavior, achieving a compute-robustness tradeoff that approaches data-filtering performance. Across language, arithmetic, and WMDP benchmarks, UNDO improves resistance to relearning and provides a practical path to robust capability removal with reduced labeling and compute costs.
Abstract
Current LLM unlearning methods are not robust. A few steps of finetuning can revert their effects. We begin by showing that this is true even for an idealized form of unlearning: training to imitate a model that was never trained on unwanted information. This shows that training a model can drastically modify its input-output behavior while leaving its underlying capabilities intact. In light of this dynamic, we show our main result. Training a randomly initialized student on the outputs of an unlearned model transfers behaviors while leaving latent capabilities behind. In short, distillation robustifies unlearning. Based on this result, we propose Unlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an unlearned model into a noised copy of itself. UNDO introduces a tunable tradeoff between compute cost and robustness, establishing a new Pareto frontier on synthetic language and arithmetic tasks. At its strongest setting, UNDO matches the robustness of a model retrained from scratch with perfect data filtering while using only 60-80% of the compute and requiring only 0.01% of the pretraining data to be labeled. We also show that UNDO robustifies unlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP) benchmark. Since distillation is widely used in practice, incorporating an unlearning step beforehand offers a convenient path to robust capability removal.
