Evaluating Deep Unlearning in Large Language Models
Ruihan Wu, Chhavi Yadav, Russ Salakhutdinov, Kamalika Chaudhuri
TL;DR
This paper defines deep unlearning for LLMs, a setting where removing a target fact must also block its deducible inferences from retained knowledge. It formalizes facts as triplets in a knowledge base and uses a rule-based deductive closure to model adversarial reasoning, introducing minimal deep unlearning sets and two benchmarks, MQuAKE and Eval-DU, to evaluate both realism and controllability. The authors propose three metrics—Success-DU, Recall, and Accuracy—and an approximation algorithm to compute them, then demonstrate that existing unlearning methods struggle to achieve deep unlearning while preserving model utility, especially on Eval-DU with longer deduction chains. White-box experiments suggest that identifying minimal deep unlearning sets can improve performance but also reveal intrinsic limits of large-scale unlearning, motivating future work on automatic discovery of deductive structures and more expressive knowledge representations. Overall, the work provides a principled framework and empirical baseline for evaluating and advancing robust deep unlearning in LLMs with practical privacy implications.
Abstract
Machine unlearning has emerged as an important component in developing safe and trustworthy models. Prior work on fact unlearning in LLMs has mostly focused on removing a specified target fact robustly, but often overlooks its deductive connections to other knowledge. We propose a new setting for fact unlearning, deep unlearning, where the goal is not only to remove a target fact but also to prevent it from being deduced via retained knowledge in the LLM and logical reasoning. We propose three novel metrics: Success-DU and Recall to measure unlearning efficacy, and Accuracy to measure the remainder model utility. To benchmark this setting, we leverage both (1) an existing real-world knowledge dataset, MQuAKE, that provides one-step deduction instances, and (2) newly construct a novel semi-synthetic dataset, Eval-DU, that allows multiple steps of realistic deductions among synthetic facts. Experiments reveal that current methods struggle with deep unlearning: they either fail to deeply unlearn, or excessively remove unrelated facts. Our results suggest that targeted algorithms may have to be developed for robust/deep fact unlearning in LLMs.
