Table of Contents
Fetching ...

LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

Chongyu Fan, Changsheng Wang, Yancheng Huang, Soumyadeep Pal, Sijia Liu

TL;DR

This work provides a full-stack examination of LLM unlearning by proposing a principled taxonomy that partitions twelve methods into divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning. It reframes evaluation beyond MCQ by introducing Open-QA-based metrics to better capture generation quality and the UE/UT tradeoffs, while dissecting robustness across in-domain relearning, out-of-domain fine-tuning, quantization, and jailbreaking. The findings reveal fundamental tradeoffs among method families, show that Open-QA metrics can reveal over-forgetting, and demonstrate that robustness designs (e.g., SAM, IRM, TAR) improve resilience across attacks. The insights aim to guide the design and evaluation of future unlearning methods, balancing safety, privacy, and utility in practical LLM deployments.

Abstract

Machine unlearning for large language models (LLMs) aims to remove undesired data, knowledge, and behaviors (e.g., for safety, privacy, or copyright) while preserving useful model capabilities. Despite rapid progress over the past two years, research in LLM unlearning remains fragmented, with limited clarity on what constitutes effective unlearning and how it should be rigorously evaluated. In this work, we present a principled taxonomy of twelve recent stateful unlearning methods, grouped into three methodological families: divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning. Building on this taxonomy, we revisit the evaluation of unlearning effectiveness (UE), utility retention (UT), and robustness (Rob), focusing on the WMDP benchmark. Our analysis shows that current evaluations, dominated by multiple-choice question (MCQ) accuracy, offer only a narrow perspective, often overstating success while overlooking the model's actual generation behavior. To address this gap, we introduce open question-answering (Open-QA) metrics that better capture generative performance and reveal the inherent UE-UT tradeoff across method families. Furthermore, we demonstrate that robustness requires finer-grained analysis: for example, vulnerabilities differ substantially between in-domain relearning and out-of-domain fine-tuning, even though both fall under model-level attacks. Through this study, we hope to deliver a full-stack revisit of LLM unlearning and actionable guidance for designing and evaluating future methods.

LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

TL;DR

This work provides a full-stack examination of LLM unlearning by proposing a principled taxonomy that partitions twelve methods into divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning. It reframes evaluation beyond MCQ by introducing Open-QA-based metrics to better capture generation quality and the UE/UT tradeoffs, while dissecting robustness across in-domain relearning, out-of-domain fine-tuning, quantization, and jailbreaking. The findings reveal fundamental tradeoffs among method families, show that Open-QA metrics can reveal over-forgetting, and demonstrate that robustness designs (e.g., SAM, IRM, TAR) improve resilience across attacks. The insights aim to guide the design and evaluation of future unlearning methods, balancing safety, privacy, and utility in practical LLM deployments.

Abstract

Machine unlearning for large language models (LLMs) aims to remove undesired data, knowledge, and behaviors (e.g., for safety, privacy, or copyright) while preserving useful model capabilities. Despite rapid progress over the past two years, research in LLM unlearning remains fragmented, with limited clarity on what constitutes effective unlearning and how it should be rigorously evaluated. In this work, we present a principled taxonomy of twelve recent stateful unlearning methods, grouped into three methodological families: divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning. Building on this taxonomy, we revisit the evaluation of unlearning effectiveness (UE), utility retention (UT), and robustness (Rob), focusing on the WMDP benchmark. Our analysis shows that current evaluations, dominated by multiple-choice question (MCQ) accuracy, offer only a narrow perspective, often overstating success while overlooking the model's actual generation behavior. To address this gap, we introduce open question-answering (Open-QA) metrics that better capture generative performance and reveal the inherent UE-UT tradeoff across method families. Furthermore, we demonstrate that robustness requires finer-grained analysis: for example, vulnerabilities differ substantially between in-domain relearning and out-of-domain fine-tuning, even though both fall under model-level attacks. Through this study, we hope to deliver a full-stack revisit of LLM unlearning and actionable guidance for designing and evaluating future methods.

Paper Structure

This paper contains 18 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Unlearning effectiveness (UE) and utility retention (UT) evaluation of unlearning methods on WMDP with Llama-3 8B Instruction. (a) $\mathrm{UE}_\text{MCQ}$ denotes accuracy on the WMDP evaluation set, and $\mathrm{UE}_\text{Open-QA}$ denotes ES on the WMDP evaluation set. The arrow direction along each axis indicates the direction of better performance. (b) $\mathrm{UT}_\text{MCQ}$ includes MMLU, TruthfulQA, and MathQA, while $\mathrm{UT}_\text{Open-QA}$ includes IFEval and GSM8K. (c) $\mathrm{UT}_\text{Avg}$ is defined as the mean of $\mathrm{UT}_\text{MCQ}$ and $\mathrm{UT}_\text{Open-QA}$, and $\mathrm{UE}_\text{Avg}$ is defined analogously.
  • Figure 2: Robustness of in-domain relearning ($\mathrm{Rob}_\text{ReL}$) and out-of-domain fine-tuning ($\mathrm{Rob}_\text{FT}$) for 12 unlearning methods on WMDP with Llama-3 8B Instruct evaluated by (a) $\mathrm{UE}_\text{MCQ}$ (Accuracy) and (b) $\mathrm{UE}_\text{Open-QA}$ (ES). Out-of-domain fine-tuning uses GSM8K, SST2, and MNLI. Methods with * include robust designs, and the first column ("unlearned") shows results before attack.
  • Figure 3: Robustness of quantization ($\mathrm{Rob}_\text{QT}$) for 12 unlearning methods on WMDP with Llama-3 8B Instruct, evaluated by (a) $\mathrm{UE}_\text{MCQ}$ (Accuracy) vs. $\mathrm{UT}_\text{MCQ}$ (MMLU) and (b) $\mathrm{UE}_\text{Open-QA}$ (ES) vs. $\mathrm{UT}_\text{Open-QA}$ (GSM8K). Lines link models pre- and post-4bit quantization; hatched markers indicate quantized models.
  • Figure 4: (a) Overall robustness of 12 unlearning methods on WMDP with Llama-3 8B Instruct, including in-domain relearning ($\mathrm{Rob}_\text{ReL}$), out-of-domain fine-tuning ($\mathrm{Rob}_\text{FT}$), quantization ($\mathrm{Rob}_\text{QT}$), and jailbreaking ($\mathrm{Rob}_\text{JA}$) evaluated by $\mathrm{UE}_\text{MCQ}$ (Accuracy) (b) Correlations between $\mathrm{Rob}_\text{JA}$ and $\mathrm{Rob}_\text{ReL}$ / $\mathrm{Rob}_\text{FT}$.
  • Figure A1: ABCD and top-4 token logits of the original (Llama-3 8B Instruct), NPO unlearned and RMU unlearned model on the WMDP evaluation set.
  • ...and 2 more figures