Table of Contents
Fetching ...

Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols

Yongwoo Kim, Sungmin Cha, Donghyun Kim

TL;DR

The paper identifies a gap in unlearning evaluation where logit-based metrics on small-scale datasets fail to detect representational forgetting. It proposes a unified benchmark combining $CKA$-based representation similarity and $k$-NN transferability, plus a Top Class-wise Forgetting protocol to stress semantic similarity with downstream tasks, evaluated on large-scale data such as ImageNet-1K. Empirical results show that many state-of-the-art unlearning methods do not meaningfully alter internal representations, even when logit-based scores look favorable, and that representation-aware metrics like $AGR$ and $H\text{-}LR$ reveal these shortcomings. The work offers a standardized, scalable evaluation protocol and releases code/datasets to advance robust unlearning research at real-world scales.

Abstract

Machine unlearning is a process to remove specific data points from a trained model while maintaining the performance on retain data, addressing privacy or legal requirements. Despite its importance, existing unlearning evaluations tend to focus on logit-based metrics (i.e., accuracy) under small-scale scenarios. We observe that this could lead to a false sense of security in unlearning approaches under real-world scenarios. In this paper, we conduct a new comprehensive evaluation that employs representation-based evaluations of the unlearned model under large-scale scenarios to verify whether the unlearning approaches genuinely eliminate the targeted forget data from the model's representation perspective. Our analysis reveals that current state-of-the-art unlearning approaches either completely degrade the representational quality of the unlearned model or merely modify the classifier (i.e., the last layer), thereby achieving superior logit-based evaluation metrics while maintaining significant representational similarity to the original model. Furthermore, we introduce a rigorous unlearning evaluation setup, in which the forgetting classes exhibit semantic similarity to downstream task classes, necessitating that feature representations diverge significantly from those of the original model, thus enabling a more rigorous evaluation from a representation perspective. We hope our benchmark serves as a standardized protocol for evaluating unlearning algorithms under realistic conditions.

Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols

TL;DR

The paper identifies a gap in unlearning evaluation where logit-based metrics on small-scale datasets fail to detect representational forgetting. It proposes a unified benchmark combining -based representation similarity and -NN transferability, plus a Top Class-wise Forgetting protocol to stress semantic similarity with downstream tasks, evaluated on large-scale data such as ImageNet-1K. Empirical results show that many state-of-the-art unlearning methods do not meaningfully alter internal representations, even when logit-based scores look favorable, and that representation-aware metrics like and reveal these shortcomings. The work offers a standardized, scalable evaluation protocol and releases code/datasets to advance robust unlearning research at real-world scales.

Abstract

Machine unlearning is a process to remove specific data points from a trained model while maintaining the performance on retain data, addressing privacy or legal requirements. Despite its importance, existing unlearning evaluations tend to focus on logit-based metrics (i.e., accuracy) under small-scale scenarios. We observe that this could lead to a false sense of security in unlearning approaches under real-world scenarios. In this paper, we conduct a new comprehensive evaluation that employs representation-based evaluations of the unlearned model under large-scale scenarios to verify whether the unlearning approaches genuinely eliminate the targeted forget data from the model's representation perspective. Our analysis reveals that current state-of-the-art unlearning approaches either completely degrade the representational quality of the unlearned model or merely modify the classifier (i.e., the last layer), thereby achieving superior logit-based evaluation metrics while maintaining significant representational similarity to the original model. Furthermore, we introduce a rigorous unlearning evaluation setup, in which the forgetting classes exhibit semantic similarity to downstream task classes, necessitating that feature representations diverge significantly from those of the original model, thus enabling a more rigorous evaluation from a representation perspective. We hope our benchmark serves as a standardized protocol for evaluating unlearning algorithms under realistic conditions.

Paper Structure

This paper contains 19 sections, 12 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: A comparison of (a) the traditional evaluation framework and (b) our proposed evaluation framework. Note that $\theta_u$ and $\theta_r$ refer to the unlearned model and the model trained on the retain set, respectively. Traditional unlearning evaluation methods primarily focus on analyzing the unlearned model's output logits to assess the effectiveness of unlearning under small-scale scenarios such as CIFAR-10. In contrast, our framework introduces additional evaluation factors by examining the unlearned model's feature representation similarity with $\theta_r$ in terms of transferability and representational similarity under the large-scale unlearning scenario, such as using ImageNet-1K.
  • Figure 2: Performance comparison between logit-based (left) and representation-based evaluation (right) reveals contrasting findings.
  • Figure 3: We visualize the feature representations from the original ($\theta_o$) retrained ($\theta_r$), and unlearned models ($\theta_u$) on a subset of ImageNet-1K using ResNet-50. Unlike $\theta_r$, which serves as the gold standard, the forget classes (highlighted in red) in the existing unlearning baselines (e.g., (d), (e), and (f)) are severely dispersed throughout the entire class distribution.
  • Figure 4: Two analyses using CKA similarity are presented: (a) CKA similarity analysis of various unlearning algorithms. The x-axis shows the similarity to $\theta_o$ and the y-axis represents the similarity to $\theta_r$. In an ideal scenario, algorithms should be positioned near $\theta_r$. However, most algorithms exhibit a high similarity to $\theta_o$, suggesting that the transformation of representations during unlearning is suboptimal. (b) The X-axis depicts the feature similarity between fully unlearned model ($\theta_u$) and only the final layer unlearned model ($\theta_u^{last}$). Y-axis represents the gap in AGL scores between $\theta_u$ and $\theta_u^{last}$. The results indicate that existing algorithms primarily modify only the last layer while maintaining the original representation space.
  • Figure 5: (a) AGL results for unlearning experiments on pretrained ResNet-50 models using CIFAR-10 and ImageNet-1K. We conduct unlearning on a pretrained model for one randomly selected class from CIFAR-10 and 100 randomly selected classes from ImageNet-1K. (b) AGL and AGR results for unlearning experiments on ImageNet-1K pretrained ResNet-50 models. We conduct unlearning on a pretrained model for 100 randomly selected classes from ImageNet-1K and report the corresponding AGL and AGR values. Notably, following SalUn's protocol, we excluded test set accuracy when calculating AGL in both (a) and (b)
  • ...and 5 more figures