Table of Contents
Fetching ...

Evaluating Deep Unlearning in Large Language Models

Ruihan Wu, Chhavi Yadav, Russ Salakhutdinov, Kamalika Chaudhuri

TL;DR

This paper defines deep unlearning for LLMs, a setting where removing a target fact must also block its deducible inferences from retained knowledge. It formalizes facts as triplets in a knowledge base and uses a rule-based deductive closure to model adversarial reasoning, introducing minimal deep unlearning sets and two benchmarks, MQuAKE and Eval-DU, to evaluate both realism and controllability. The authors propose three metrics—Success-DU, Recall, and Accuracy—and an approximation algorithm to compute them, then demonstrate that existing unlearning methods struggle to achieve deep unlearning while preserving model utility, especially on Eval-DU with longer deduction chains. White-box experiments suggest that identifying minimal deep unlearning sets can improve performance but also reveal intrinsic limits of large-scale unlearning, motivating future work on automatic discovery of deductive structures and more expressive knowledge representations. Overall, the work provides a principled framework and empirical baseline for evaluating and advancing robust deep unlearning in LLMs with practical privacy implications.

Abstract

Machine unlearning has emerged as an important component in developing safe and trustworthy models. Prior work on fact unlearning in LLMs has mostly focused on removing a specified target fact robustly, but often overlooks its deductive connections to other knowledge. We propose a new setting for fact unlearning, deep unlearning, where the goal is not only to remove a target fact but also to prevent it from being deduced via retained knowledge in the LLM and logical reasoning. We propose three novel metrics: Success-DU and Recall to measure unlearning efficacy, and Accuracy to measure the remainder model utility. To benchmark this setting, we leverage both (1) an existing real-world knowledge dataset, MQuAKE, that provides one-step deduction instances, and (2) newly construct a novel semi-synthetic dataset, Eval-DU, that allows multiple steps of realistic deductions among synthetic facts. Experiments reveal that current methods struggle with deep unlearning: they either fail to deeply unlearn, or excessively remove unrelated facts. Our results suggest that targeted algorithms may have to be developed for robust/deep fact unlearning in LLMs.

Evaluating Deep Unlearning in Large Language Models

TL;DR

This paper defines deep unlearning for LLMs, a setting where removing a target fact must also block its deducible inferences from retained knowledge. It formalizes facts as triplets in a knowledge base and uses a rule-based deductive closure to model adversarial reasoning, introducing minimal deep unlearning sets and two benchmarks, MQuAKE and Eval-DU, to evaluate both realism and controllability. The authors propose three metrics—Success-DU, Recall, and Accuracy—and an approximation algorithm to compute them, then demonstrate that existing unlearning methods struggle to achieve deep unlearning while preserving model utility, especially on Eval-DU with longer deduction chains. White-box experiments suggest that identifying minimal deep unlearning sets can improve performance but also reveal intrinsic limits of large-scale unlearning, motivating future work on automatic discovery of deductive structures and more expressive knowledge representations. Overall, the work provides a principled framework and empirical baseline for evaluating and advancing robust deep unlearning in LLMs with practical privacy implications.

Abstract

Machine unlearning has emerged as an important component in developing safe and trustworthy models. Prior work on fact unlearning in LLMs has mostly focused on removing a specified target fact robustly, but often overlooks its deductive connections to other knowledge. We propose a new setting for fact unlearning, deep unlearning, where the goal is not only to remove a target fact but also to prevent it from being deduced via retained knowledge in the LLM and logical reasoning. We propose three novel metrics: Success-DU and Recall to measure unlearning efficacy, and Accuracy to measure the remainder model utility. To benchmark this setting, we leverage both (1) an existing real-world knowledge dataset, MQuAKE, that provides one-step deduction instances, and (2) newly construct a novel semi-synthetic dataset, Eval-DU, that allows multiple steps of realistic deductions among synthetic facts. Experiments reveal that current methods struggle with deep unlearning: they either fail to deeply unlearn, or excessively remove unrelated facts. Our results suggest that targeted algorithms may have to be developed for robust/deep fact unlearning in LLMs.

Paper Structure

This paper contains 40 sections, 1 theorem, 7 equations, 11 figures, 10 tables, 3 algorithms.

Key Result

Theorem 1

$\hat{M}_{k, \mathcal{R}, \mathcal{K}}$ returned by Algorithm algo:main is a collection of minimal deep unlearning sets.

Figures (11)

  • Figure 1: An example that unlearning only the target fact is insufficient. The successful extraction of "Wyatt Ross's father is Xavier Ross" and "Camila Flores's husband is Xavier Ross" can imply the target fact.
  • Figure 2: An illustration of deep unlearning. (a) an example of superficial unlearning; (b) an example of deep unlearning; (c) two different minimal deep unlearning sets for unlearning the same target fact; (d) the calculation of our proposed evaluation metric recall.
  • Figure 3: Histogram of # minimal deep unlearning sets founded by Algorithm \ref{['algo:main']}.
  • Figure 4: An example of 4 minimal deep unlearning sets founded by Algorithm \ref{['algo:main']}.
  • Figure 5: Distribution of relations in our synthetic dataset.
  • ...and 6 more figures

Theorems & Definitions (5)

  • Definition 1: Deductive closure
  • Definition 2: Deep unlearning
  • Definition 3: Minimal deep unlearning
  • Theorem 1
  • proof