Table of Contents
Fetching ...

POUR: A Provably Optimal Method for Unlearning Representations via Neural Collapse

Anjie Le, Can Peng, Yuyuan Liu, J. Alison Noble

TL;DR

This work reframes machine unlearning as a representation-level problem guided by Neural Collapse geometry. It introduces a provably optimal projection-based forgetting operator (POUR) with a closed-form variant (POUR-P) and a distillation-based variant (POUR-D); both preserve the simplex ETF structure among retained classes while erasing the forgotten one. The authors establish theoretical guarantees: (i) ETF geometry implies Bayes-optimality for retained classes, and (ii) orthogonal projection preserves this geometry under forgetting, enabling complete forgetting of the target class in the NC limit. Empirically, POUR achieves superior forgetting and retention across CIFAR-10/100 and PathMNIST, outperforming baselines on both classification- and representation-level metrics, and demonstrating robustness under domain shift. The work provides a principled, geometry-driven approach to representation-level unlearning with practical implications for privacy, safety, and deployment of pretrained models.

Abstract

In computer vision, machine unlearning aims to remove the influence of specific visual concepts or training images without retraining from scratch. Studies show that existing approaches often modify the classifier while leaving internal representations intact, resulting in incomplete forgetting. In this work, we extend the notion of unlearning to the representation level, deriving a three-term interplay between forgetting efficacy, retention fidelity, and class separation. Building on Neural Collapse theory, we show that the orthogonal projection of a simplex Equiangular Tight Frame (ETF) remains an ETF in a lower dimensional space, yielding a provably optimal forgetting operator. We further introduce the Representation Unlearning Score (RUS) to quantify representation-level forgetting and retention fidelity. Building on this, we introduce POUR (Provably Optimal Unlearning of Representations), a geometric projection method with closed-form (POUR-P) and a feature-level unlearning variant under a distillation scheme (POUR-D). Experiments on CIFAR-10/100 and PathMNIST demonstrate that POUR achieves effective unlearning while preserving retained knowledge, outperforming state-of-the-art unlearning methods on both classification-level and representation-level metrics.

POUR: A Provably Optimal Method for Unlearning Representations via Neural Collapse

TL;DR

This work reframes machine unlearning as a representation-level problem guided by Neural Collapse geometry. It introduces a provably optimal projection-based forgetting operator (POUR) with a closed-form variant (POUR-P) and a distillation-based variant (POUR-D); both preserve the simplex ETF structure among retained classes while erasing the forgotten one. The authors establish theoretical guarantees: (i) ETF geometry implies Bayes-optimality for retained classes, and (ii) orthogonal projection preserves this geometry under forgetting, enabling complete forgetting of the target class in the NC limit. Empirically, POUR achieves superior forgetting and retention across CIFAR-10/100 and PathMNIST, outperforming baselines on both classification- and representation-level metrics, and demonstrating robustness under domain shift. The work provides a principled, geometry-driven approach to representation-level unlearning with practical implications for privacy, safety, and deployment of pretrained models.

Abstract

In computer vision, machine unlearning aims to remove the influence of specific visual concepts or training images without retraining from scratch. Studies show that existing approaches often modify the classifier while leaving internal representations intact, resulting in incomplete forgetting. In this work, we extend the notion of unlearning to the representation level, deriving a three-term interplay between forgetting efficacy, retention fidelity, and class separation. Building on Neural Collapse theory, we show that the orthogonal projection of a simplex Equiangular Tight Frame (ETF) remains an ETF in a lower dimensional space, yielding a provably optimal forgetting operator. We further introduce the Representation Unlearning Score (RUS) to quantify representation-level forgetting and retention fidelity. Building on this, we introduce POUR (Provably Optimal Unlearning of Representations), a geometric projection method with closed-form (POUR-P) and a feature-level unlearning variant under a distillation scheme (POUR-D). Experiments on CIFAR-10/100 and PathMNIST demonstrate that POUR achieves effective unlearning while preserving retained knowledge, outperforming state-of-the-art unlearning methods on both classification-level and representation-level metrics.

Paper Structure

This paper contains 36 sections, 14 theorems, 69 equations, 6 figures, 3 tables.

Key Result

Proposition 2.2

Let $P_z^{(f)}$ and $P_z^{(r)}$ denote the feature distributions induced by the unlearned and retrained models, respectively. For a forget class $u \in \mathcal{Y}$ and an Integral Probability Metric (IPM) $\mathcal{K}$ defined on the feature space, by the law of total probability we can express where $\alpha := P_z^{(f)}(\hat{y}=u)$ and $\beta := P_z^{(r)}(\hat{y}=u)$ are the predicted probabili

Figures (6)

  • Figure 1: Grad-CAM visualization on PathMNIST before and after unlearning. Each row shows a tissue class. After applying POUR on the adipose class, its Grad-CAM signal vanishes, while the retained classes (debris, lymphocytes, mucus) preserve clear and distinct attention patterns.
  • Figure 1: Grad-CAM visualization on PathMNIST before and after unlearning. Each row shows a tissue class. Only after POUR unlearning, the Grad-CAM signal vanishes.
  • Figure 2: C=4 simplex ETF. One vertex $v_1$ along $+z$; the other three lie at $z=-1/3$ with equal $120^\circ$ separation in $xy$. Orthogonal projection onto $v_1^\perp$ ($z=0$) yields an equilateral triangle formed by $u_2,u_3,u_4$.
  • Figure 3: Overview of the POUR framework. During training, the unlearning module applies an orthogonal projection operator $P_A$ on the feature space of the original model to remove the contribution of the forgotten class $A$. The unlearned feature extractor $\theta'$ is optimized via an $L_2$ loss to align its projected features with those of the original extractor $\theta$ using the unlearning data. This alignment preserves the Neural Collapse geometry among retained classes ($B$, $C$, $D$) while collapsing features of the forgotten class to the origin, leading to uniform predictions. At inference, the unlearned model is Bayes-optimal on retained classes as proved in Theorem \ref{['thm:opt']}.
  • Figure 4: t-SNE visualization of representation spaces after unlearning on CIFAR-10 and CIFAR-100. Each color denotes a retained class, with dark red points represent the forgotten class. The Gold panel shows the representation of the retrained model, serving as the ideal reference for successful unlearning. Structure of representations after POUR unlearning mostly resemble that of the retrained gold model.
  • ...and 1 more figures

Theorems & Definitions (26)

  • Definition 2.1: Representation-Level Weak Unlearning
  • Proposition 2.2: Decomposition of $\mathcal{K}$ Bound
  • Proposition 3.1: ETF $\Rightarrow$ Bayes optimality
  • Proposition 3.2: Projection of a simplex ETF remains a simplex ETF
  • Proposition 4.1: L2 convergence implies CKA convergence
  • Theorem 4.2: Optimality of POUR-P
  • Proposition 1.1: Decomposition of $\mathcal{K}$ Bound
  • proof
  • Proposition 1.2: CKA is invariant to isotropic scaling
  • proof
  • ...and 16 more