Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols
Yongwoo Kim, Sungmin Cha, Donghyun Kim
TL;DR
The paper identifies a gap in unlearning evaluation where logit-based metrics on small-scale datasets fail to detect representational forgetting. It proposes a unified benchmark combining $CKA$-based representation similarity and $k$-NN transferability, plus a Top Class-wise Forgetting protocol to stress semantic similarity with downstream tasks, evaluated on large-scale data such as ImageNet-1K. Empirical results show that many state-of-the-art unlearning methods do not meaningfully alter internal representations, even when logit-based scores look favorable, and that representation-aware metrics like $AGR$ and $H\text{-}LR$ reveal these shortcomings. The work offers a standardized, scalable evaluation protocol and releases code/datasets to advance robust unlearning research at real-world scales.
Abstract
Machine unlearning is a process to remove specific data points from a trained model while maintaining the performance on retain data, addressing privacy or legal requirements. Despite its importance, existing unlearning evaluations tend to focus on logit-based metrics (i.e., accuracy) under small-scale scenarios. We observe that this could lead to a false sense of security in unlearning approaches under real-world scenarios. In this paper, we conduct a new comprehensive evaluation that employs representation-based evaluations of the unlearned model under large-scale scenarios to verify whether the unlearning approaches genuinely eliminate the targeted forget data from the model's representation perspective. Our analysis reveals that current state-of-the-art unlearning approaches either completely degrade the representational quality of the unlearned model or merely modify the classifier (i.e., the last layer), thereby achieving superior logit-based evaluation metrics while maintaining significant representational similarity to the original model. Furthermore, we introduce a rigorous unlearning evaluation setup, in which the forgetting classes exhibit semantic similarity to downstream task classes, necessitating that feature representations diverge significantly from those of the original model, thus enabling a more rigorous evaluation from a representation perspective. We hope our benchmark serves as a standardized protocol for evaluating unlearning algorithms under realistic conditions.
