Table of Contents
Fetching ...

RESTOR: Knowledge Recovery in Machine Unlearning

Keivan Rezaei, Khyathi Chandu, Soheil Feizi, Yejin Choi, Faeze Brahman, Abhilasha Ravichander

TL;DR

RESTOR reframes machine unlearning as restoration of a model’s original knowledge state, not merely forgetting targeted data. By simulating corruption of factual knowledge and applying unlearning algorithms, it evaluates whether a model can recover to a state equivalent to one trained without the unlearned datapoints, using a knowledge-centric dataset and a dedicated evaluation protocol. The study finds that while GA and KL can forget corrupted content, they often fail to restore, whereas NPO frequently achieves restorative unlearning, especially when unrelated context is minimized. The framework reveals that the success of restoration depends on the confidence with which the clean model encodes a relation and on the locality of unlearning targets, offering a principled path for evaluating and improving unlearning methods in real-world LLMs.

Abstract

Large language models trained on web-scale corpora can memorize undesirable data containing misinformation, copyrighted material, or private or sensitive information. Recently, several machine unlearning algorithms have been proposed to eliminate the effect of such datapoints from trained models -- that is, to approximate a model that had never been trained on these datapoints in the first place. However, evaluating the effectiveness of unlearning algorithms remains an open challenge. Previous work has relied on heuristics -- such as verifying that the model can no longer reproduce the specific information targeted for removal while maintaining accuracy on unrelated test data. These approaches inadequately capture the complete effect of reversing the influence of datapoints on a trained model. In this work, we propose the RESTOR framework for machine unlearning evaluation, which assesses the ability of unlearning algorithms for targeted data erasure, by evaluating the ability of models to forget the knowledge introduced in these datapoints, while simultaneously recovering the model's knowledge state had it never encountered these datapoints. RESTOR helps uncover several novel insights about popular unlearning algorithms, and the mechanisms through which they operate -- for instance, identifying that some algorithms merely emphasize forgetting but not recovering knowledge, and that localizing unlearning targets can enhance unlearning performance.

RESTOR: Knowledge Recovery in Machine Unlearning

TL;DR

RESTOR reframes machine unlearning as restoration of a model’s original knowledge state, not merely forgetting targeted data. By simulating corruption of factual knowledge and applying unlearning algorithms, it evaluates whether a model can recover to a state equivalent to one trained without the unlearned datapoints, using a knowledge-centric dataset and a dedicated evaluation protocol. The study finds that while GA and KL can forget corrupted content, they often fail to restore, whereas NPO frequently achieves restorative unlearning, especially when unrelated context is minimized. The framework reveals that the success of restoration depends on the confidence with which the clean model encodes a relation and on the locality of unlearning targets, offering a principled path for evaluating and improving unlearning methods in real-world LLMs.

Abstract

Large language models trained on web-scale corpora can memorize undesirable data containing misinformation, copyrighted material, or private or sensitive information. Recently, several machine unlearning algorithms have been proposed to eliminate the effect of such datapoints from trained models -- that is, to approximate a model that had never been trained on these datapoints in the first place. However, evaluating the effectiveness of unlearning algorithms remains an open challenge. Previous work has relied on heuristics -- such as verifying that the model can no longer reproduce the specific information targeted for removal while maintaining accuracy on unrelated test data. These approaches inadequately capture the complete effect of reversing the influence of datapoints on a trained model. In this work, we propose the RESTOR framework for machine unlearning evaluation, which assesses the ability of unlearning algorithms for targeted data erasure, by evaluating the ability of models to forget the knowledge introduced in these datapoints, while simultaneously recovering the model's knowledge state had it never encountered these datapoints. RESTOR helps uncover several novel insights about popular unlearning algorithms, and the mechanisms through which they operate -- for instance, identifying that some algorithms merely emphasize forgetting but not recovering knowledge, and that localizing unlearning targets can enhance unlearning performance.

Paper Structure

This paper contains 43 sections, 3 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: RESTOR framework for machine unlearning evaluation. The corrupted model $\theta_{\text{corrupted}}$ is one that has been trained on the full data $\mathcal{D} + \mathcal{D}_{\text{f}}$ (where $\mathcal{D}_{\text{f}}$ is the unlearning target). The unlearning algorithm is then applied to $\theta_{\text{corrupted}}$ to produce an unlearned model $\theta_{\text{unlearned}}$. $\theta_{\text{unlearned}}$ should ideally approximate the behavior of a model $\theta_{\text{ideal}}$ which was never exposed to the unlearning target i.e. trained on $\mathcal{D}$ only. RESTOR characterizes the knowledge state of models, evaluating if the unlearning algorithm restores the model $\theta_{\text{unlearned}}$'s knowledge state to match that of $\theta_{\text{ideal}}$.
  • Figure 2: Probability distributions of clean, corrupted, and unlearned models across three output categories: clean (the original objects generated by the clean model), corrupted (the perturbed objects generated by the model after the corruption procedure), and random (that are possible valid outputs for a question, that are not the clean or corrupted objects) (x-axis). NPO restores clean probabilities by lowering the likelihood of corrupted objects, while GA shifts corrupted probabilities toward random outputs, not recovering the knowledge.
  • Figure 3: Restorative unlearning is more feasible for relations (properties) well-known to the clean model, while on harder relations, unlearning more results in forgetting incorrect facts but making incorrect predictions. Each point represents a relation and clean accuracy shows original model accuracy of this relation across entities. We observe a positive correlation between clean model's performance on the relation and recovery rate after unlearning. Plots for $k=4$ and $k=5$ as corruption scenarios and NPO as the unlearning algorithm.
  • Figure 4: Probability distributions assigned by models when corruption is done using SQuAD. None of the algorithms is able to recover the clean outputs probability.
  • Figure 5: Restoration and forgetting ratios across different corruption scenarios ($k=2, 3, 4, 5$) for unlearning methods NPO, GA, and KL.