Table of Contents
Fetching ...

Distillation Robustifies Unlearning

Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, Alexander Matt Turner

TL;DR

This work addresses the fragility of existing unlearning methods in LLMs, showing that even oracle-matched behavior can leave latent capabilities that are easily relearned. It demonstrates that distilling outputs from an unlearned model into a randomly initialized student transfers desired behavior while discarding latent forget traces, enabling robust unlearning. The authors introduce UNDO (Unlearn-Noise-Distill-on-Outputs), a tunable method that degrades parameters via controlled corruption and distills to recover the teacher’s behavior, achieving a compute-robustness tradeoff that approaches data-filtering performance. Across language, arithmetic, and WMDP benchmarks, UNDO improves resistance to relearning and provides a practical path to robust capability removal with reduced labeling and compute costs.

Abstract

Current LLM unlearning methods are not robust. A few steps of finetuning can revert their effects. We begin by showing that this is true even for an idealized form of unlearning: training to imitate a model that was never trained on unwanted information. This shows that training a model can drastically modify its input-output behavior while leaving its underlying capabilities intact. In light of this dynamic, we show our main result. Training a randomly initialized student on the outputs of an unlearned model transfers behaviors while leaving latent capabilities behind. In short, distillation robustifies unlearning. Based on this result, we propose Unlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an unlearned model into a noised copy of itself. UNDO introduces a tunable tradeoff between compute cost and robustness, establishing a new Pareto frontier on synthetic language and arithmetic tasks. At its strongest setting, UNDO matches the robustness of a model retrained from scratch with perfect data filtering while using only 60-80% of the compute and requiring only 0.01% of the pretraining data to be labeled. We also show that UNDO robustifies unlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP) benchmark. Since distillation is widely used in practice, incorporating an unlearning step beforehand offers a convenient path to robust capability removal.

Distillation Robustifies Unlearning

TL;DR

This work addresses the fragility of existing unlearning methods in LLMs, showing that even oracle-matched behavior can leave latent capabilities that are easily relearned. It demonstrates that distilling outputs from an unlearned model into a randomly initialized student transfers desired behavior while discarding latent forget traces, enabling robust unlearning. The authors introduce UNDO (Unlearn-Noise-Distill-on-Outputs), a tunable method that degrades parameters via controlled corruption and distills to recover the teacher’s behavior, achieving a compute-robustness tradeoff that approaches data-filtering performance. Across language, arithmetic, and WMDP benchmarks, UNDO improves resistance to relearning and provides a practical path to robust capability removal with reduced labeling and compute costs.

Abstract

Current LLM unlearning methods are not robust. A few steps of finetuning can revert their effects. We begin by showing that this is true even for an idealized form of unlearning: training to imitate a model that was never trained on unwanted information. This shows that training a model can drastically modify its input-output behavior while leaving its underlying capabilities intact. In light of this dynamic, we show our main result. Training a randomly initialized student on the outputs of an unlearned model transfers behaviors while leaving latent capabilities behind. In short, distillation robustifies unlearning. Based on this result, we propose Unlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an unlearned model into a noised copy of itself. UNDO introduces a tunable tradeoff between compute cost and robustness, establishing a new Pareto frontier on synthetic language and arithmetic tasks. At its strongest setting, UNDO matches the robustness of a model retrained from scratch with perfect data filtering while using only 60-80% of the compute and requiring only 0.01% of the pretraining data to be labeled. We also show that UNDO robustifies unlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP) benchmark. Since distillation is widely used in practice, incorporating an unlearning step beforehand offers a convenient path to robust capability removal.

Paper Structure

This paper contains 28 sections, 1 equation, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Distillation robustifies unlearning. Existing LLM unlearning methods suppress undesired behavior, but are reversible using a small amount of finetuning. We show that distilling the suppressed model into a randomly initialized network significantly increases resilience against reacquiring the undesired behavior. Our method substantially outperforms other robust unlearning baselines, including RepNoise rosati2024representation and SAM fan2025towards.
  • Figure 2: Matching oracle behavior doesn't guarantee robust unlearning. (a) KL divergence during distillation shows behavioral alignment with Oracle Teacher. (b-c) Despite this alignment, reference models matched to the oracle (Student (Reference)) exhibit rapid relearning of undesired capabilities when finetuned on the forget set, compared to the randomly initialized model matched to the oracle (Student (Random)) and the oracle itself (Oracle Teacher). Results highlight that an ideal unlearned behavior on the surface is insufficient for ensuring robustness against relearning.
  • Figure 3: Comparing unlearning methods. (a--c) Unlearning trends across hyperparameters for our language setup, where we select configurations that maximize retain performance while minimizing forget performance for distillation (see Figures \ref{['figure:s4-relearning']} and \ref{['figure:s5-compute']}). (d--f) Corresponding trends in arithmetic.
  • Figure 4: Unlearn-and-Distill boosts robustness to relearning. (a-c) Relearning trends for the language forget domain (Korean), comparing unlearning-only methods (GradDiff, MaxEnt, RMU) against models with an additional distillation step, measured against the gold standard of full retraining. We highlight the least favorable learning curve for each method. (d-f) Relearning trends for the arithmetic forget domain (Multiplication & Division).
  • Figure 5: Unlearning robustness scales with more perturbation. (a, c) UNDO scaling trend for $\alpha$ between 0.1 and 0.8 and $\beta = 0.1$, showing trade-off between robustness measured as $(P_{\text{UNDO}} - P_{\text{Unlearn Only}})/(P_{\text{Data Filtering}} - P_{\text{Unlearn Only}})$ where $P$ is forget performance, and compute measured as $S_\text{UNDO}/S_\text{Data Filtering}$ where $S$ is training steps. Points denote median values, error bars show variation across five random seeds. (b) Relearning trends for Korean domain with $\alpha = \{0.2, 0.4, 0.6, 0.8\}$. (d) Relearning trends for Multiplication & Division with $\alpha = \{0.55, 0.65, 0.7, 0.75\}$.
  • ...and 8 more figures