Table of Contents
Fetching ...

The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

Hengrui Jia, Taoran Li, Jonas Guan, Varun Chandrasekaran

TL;DR

This work challenges the adequacy of existing LLM unlearning metrics that focus solely on the unlearning set Du, arguing that forgetting often encompasses broader knowledge patterns. It introduces Proximal Surrogate Generation (PSG) to automatically construct surrogate datasets tildeDu that are semantically tied to Du but embedding-distant, enabling stress-testing of forgetting metrics. Empirically, across 3 LLM families, 3 datasets, and 2 unlearning methods with 7 metrics, the authors reveal widespread inconsistencies between Du and tildeDu scores, showing that many metrics overestimate unlearning success. The paper advocates for evaluation frameworks that measure generalized knowledge removal and endorses designing metrics and external test data that reflect real-world goals like copyright or safety-related unlearning.

Abstract

Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model's performance degradation on the specific unlearning dataset ($D_u$). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning--motivated by copyright or safety--implicitly target not only verbatim content in $D_u$, but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have ``forgotten'' the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to $D_u$. This phenomenon indicates that erasing exact sentences does not necessarily equate to removing the underlying knowledge. To address this gap, we propose \name, an automated stress-testing framework that generates a surrogate dataset, $\tilde{D}_u$. This surrogate set is constructed to be semantically derived from $D_u$ yet sufficiently distinct in embedding space. By comparing unlearning metric scores between $D_u$ and $\tilde{D}_u$, we can stress-test the reliability of the metric itself. Our extensive evaluation across three LLM families (Llama-3-8B, Qwen2.5-7B, and Zephyr-7B-$β$), three distinct datasets, and seven standard metrics reveals widespread inconsistencies. We find that current metrics frequently overestimate unlearning success, failing to detect retained knowledge exposed by our stress-test datasets.

The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

TL;DR

This work challenges the adequacy of existing LLM unlearning metrics that focus solely on the unlearning set Du, arguing that forgetting often encompasses broader knowledge patterns. It introduces Proximal Surrogate Generation (PSG) to automatically construct surrogate datasets tildeDu that are semantically tied to Du but embedding-distant, enabling stress-testing of forgetting metrics. Empirically, across 3 LLM families, 3 datasets, and 2 unlearning methods with 7 metrics, the authors reveal widespread inconsistencies between Du and tildeDu scores, showing that many metrics overestimate unlearning success. The paper advocates for evaluation frameworks that measure generalized knowledge removal and endorses designing metrics and external test data that reflect real-world goals like copyright or safety-related unlearning.

Abstract

Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model's performance degradation on the specific unlearning dataset (). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning--motivated by copyright or safety--implicitly target not only verbatim content in , but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have ``forgotten'' the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to . This phenomenon indicates that erasing exact sentences does not necessarily equate to removing the underlying knowledge. To address this gap, we propose \name, an automated stress-testing framework that generates a surrogate dataset, . This surrogate set is constructed to be semantically derived from yet sufficiently distinct in embedding space. By comparing unlearning metric scores between and , we can stress-test the reliability of the metric itself. Our extensive evaluation across three LLM families (Llama-3-8B, Qwen2.5-7B, and Zephyr-7B-), three distinct datasets, and seven standard metrics reveals widespread inconsistencies. We find that current metrics frequently overestimate unlearning success, failing to detect retained knowledge exposed by our stress-test datasets.

Paper Structure

This paper contains 29 sections, 13 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Min-k% metric scores with respect to the average embedding distance from data points in $\tilde{D}_u$ to the 100 nearest neighbors in $D_u$. One can see a clear correlation as indicated by the red regression line in the figure. This suggests despite the metric is expected to perform consistently across $D_u$ and $\tilde{D}_u$, it can be impacted by where the data points located in the embedding space.
  • Figure 2: Boxplots of metric scores of the unlearning dataset $D_u$, surrogate unlearning dataset $\tilde{D}_u$, and retain dataset $D_r$, when Book X is unlearned from a Llama-3 model using (a) NPO and (b) RMU. The metric scores are normalized to a [0,1] range, and a larger value indicates unsuccessful unlearning (according to the metrics). One can observe that the box corresponding to $\tilde{D}_u$ is almost always higher than the box corresponding to $D_u$. This indicates that for this setting, PSG is able to create sentences that falsify the unlearning metrics by demonstrating they are unlearned less successfully than $D_u$.
  • Figure 3: Standardized mean difference between metric values of $D_u$ and $\tilde{D}_u$, with respect to $\tau_{\text{dist}}$. Here $\tau_{\text{dist}}$ is relative to the average embedding distance among points in $D_u$. It can be observed that for both NPO and RMU, there is a positive correlation between the standardized mean difference and $\tau_{\text{dist}}$ for most of the metrics. This validates our hypothesis that increasing the embedding distances between $D_u$ and $\tilde{D}_u$ can cause the metrics to perform more differently on them.
  • Figure 4: Boxplots of metric scores of 2 surrogate unlearning datasets, $\tilde{D}_u (\ell_2)$ and $\tilde{D}_u(cos)$, where the embedding distance is measured using $\ell_2$ distance and $cos$ distance respectively. The metric scores of the unlearning dataset $D_u$ and retain dataset $D_r$ are also presented as reference. One can observe that for most cases, $\ell_2$ distance results in slightly more different (mean) scores from $D_u$ than $cos$ distance, and smaller variance in the distribution of the metric scores. Although the differences are not significant, we choose to use $\ell_2$ distance as the result of this ablation study.
  • Figure 5: Boxplots of metric scores of 2 surrogate unlearning datasets, $\tilde{D}_u$ and $\tilde{D}_u(\text{scratch})$. The sentences in the former are generated by continuing writing from sentences in $D_u$, whereas the latter contains sentences written from scratch (i.e., generating from an empty string). The metric scores of the unlearning dataset $D_u$ and retain dataset $D_r$ are also presented as reference. It can be seen that for most of the metrics, the difference between the boxes corresponding to the 2 surrogate unlearning datasets are negligible. Thus, we conclude that generating whether from a sentence in $D_u$ or an empty string does not significantly impacts the performance of PSG.
  • ...and 12 more figures