Table of Contents
Fetching ...

Unlearning's Blind Spots: Over-Unlearning and Prototypical Relearning Attack

SeungBum Ha, Saerom Park, Sung Whan Yoon

TL;DR

This work reveals two overlooked risks in class‑level machine unlearning: boundary‑proximal over‑unlearning and post‑unlearning prototypical relearning. It introduces OU@ε as a retainment‑free metric for collateral damage near forget boundaries and the Prototypical Relearning Attack (PRA) that exploits residual forget‑class structure. The authors propose Spotter, a plug‑and‑play objective that combines masked distillation around forget‑adjacent regions with an intra‑class dispersion loss to both reduce OU@ε and neutralize PRA without using retained data. Across CIFAR, TinyImageNet, and CASIA‑WebFace, Spotter achieves near‑complete forgetting while preserving retained utility, demonstrating practical applicability for privacy‑preserving unlearning in diverse architectures including transformers.

Abstract

Machine unlearning (MU) aims to expunge a designated forget set from a trained model without costly retraining, yet the existing techniques overlook two critical blind spots: "over-unlearning" that deteriorates retained data near the forget set, and post-hoc "relearning" attacks that aim to resurrect the forgotten knowledge. Focusing on class-level unlearning, we first derive an over-unlearning metric, OU@epsilon, which quantifies collateral damage in regions proximal to the forget set, where over-unlearning mainly appears. Next, we expose an unforeseen relearning threat on MU, i.e., the Prototypical Relearning Attack, which exploits the per-class prototype of the forget class with just a few samples, and easily restores the pre-unlearning performance. To counter both blind spots in class-level unlearning, we introduce Spotter, a plug-and-play objective that combines (i) a masked knowledge-distillation penalty on the nearby region of forget classes to suppress OU@epsilon, and (ii) an intra-class dispersion loss that scatters forget-class embeddings, neutralizing Prototypical Relearning Attacks. Spotter achieves state-of-the-art results across CIFAR, TinyImageNet, and CASIA-WebFace datasets, offering a practical remedy to unlearning's blind spots.

Unlearning's Blind Spots: Over-Unlearning and Prototypical Relearning Attack

TL;DR

This work reveals two overlooked risks in class‑level machine unlearning: boundary‑proximal over‑unlearning and post‑unlearning prototypical relearning. It introduces OU@ε as a retainment‑free metric for collateral damage near forget boundaries and the Prototypical Relearning Attack (PRA) that exploits residual forget‑class structure. The authors propose Spotter, a plug‑and‑play objective that combines masked distillation around forget‑adjacent regions with an intra‑class dispersion loss to both reduce OU@ε and neutralize PRA without using retained data. Across CIFAR, TinyImageNet, and CASIA‑WebFace, Spotter achieves near‑complete forgetting while preserving retained utility, demonstrating practical applicability for privacy‑preserving unlearning in diverse architectures including transformers.

Abstract

Machine unlearning (MU) aims to expunge a designated forget set from a trained model without costly retraining, yet the existing techniques overlook two critical blind spots: "over-unlearning" that deteriorates retained data near the forget set, and post-hoc "relearning" attacks that aim to resurrect the forgotten knowledge. Focusing on class-level unlearning, we first derive an over-unlearning metric, OU@epsilon, which quantifies collateral damage in regions proximal to the forget set, where over-unlearning mainly appears. Next, we expose an unforeseen relearning threat on MU, i.e., the Prototypical Relearning Attack, which exploits the per-class prototype of the forget class with just a few samples, and easily restores the pre-unlearning performance. To counter both blind spots in class-level unlearning, we introduce Spotter, a plug-and-play objective that combines (i) a masked knowledge-distillation penalty on the nearby region of forget classes to suppress OU@epsilon, and (ii) an intra-class dispersion loss that scatters forget-class embeddings, neutralizing Prototypical Relearning Attacks. Spotter achieves state-of-the-art results across CIFAR, TinyImageNet, and CASIA-WebFace datasets, offering a practical remedy to unlearning's blind spots.

Paper Structure

This paper contains 25 sections, 11 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: UMAP plots of CIFAR-10 representations computed with unlearned feature extractors, where the forget class is highlighted in ('red') and the remaining classes are shown in ('gray').
  • Figure 2: (a) Acc$_{f}$ comparisons of Original, Unlearned, and Relearned models. Unlearned models without relearning (‘Unlearn’), and the relearned models on $N$ samples with a single epoch (‘RelearnN’). (b) Acc$_{f}$ comparisons of Original, Unlearned, and Prototypical Relearned models. Only $N$ samples used for computing the class-prototype ('ProtoN'). We set the hyperparameter $\alpha$ so that the drop in Acc$_{r}$ does not exceed 1%.
  • Figure 3: UMAP plots on CIFAR-10 representations computed with Spotter-unlearned feature extractor, where the forget class is highlighted in ('red') and the remaining classes, ('gray').
  • Figure 4: Grad-CAM analysis on face recognition under PRA. Class activation maps for forget-class face images across Original, Retrain, DELETE, and Spotter, clean and under the PRA.
  • Figure 5: (a) Ablation studies on $\varepsilon$ for $\mathcal{L}_{\text{o}}$. We unlearn models with varying $\varepsilon$ used in $\mathcal{L}_{\text{o}}$ and measure the over-unlearning with $\varepsilon=0.03$ perturbed set. (b) Ablation studies on $\varepsilon$ for $\operatorname{OU}@\varepsilon$. We measure $\operatorname{OU}@\varepsilon$ varying $\varepsilon$ used to generate the perturbed set. Dotted lines represent the case where Spotter is not applied.