Revisiting Machine Unlearning with Dimensional Alignment

Seonguk Seo; Dongwan Kim; Bohyung Han

Revisiting Machine Unlearning with Dimensional Alignment

Seonguk Seo, Dongwan Kim, Bohyung Han

TL;DR

This work addresses the challenge of machine unlearning under data privacy constraints by shifting focus from label-level mislearning to latent feature-space changes. It introduces Dimensional Alignment (DA) as a metric and regularizer that aligns forget-set features with the retain-set feature manifold, and couples this with a self-distillation loss in an alternating forget/recover training scheme (MUDA). The authors demonstrate that MUDA effectively erases information from forget data while preserving retain-performance, and propose a set of feature-space evaluation tools (DA, LP, F1, NMI) that better reflect unlearning goals than traditional output-based metrics. The approach yields stable training, robust defense against backdoors, and results that closely match retraining, with practical implications for privacy-compliant deployment of deep models.

Abstract

Machine unlearning, an emerging research topic focusing on compliance with data privacy regulations, enables trained models to remove the information learned from specific data. While many existing methods indirectly address this issue by intentionally injecting incorrect supervisions, they can drastically and unpredictably alter the decision boundaries and feature spaces, leading to training instability and undesired side effects. To fundamentally approach this task, we first analyze the changes in latent feature spaces between original and retrained models, and observe that the feature representations of samples not involved in training are closely aligned with the feature manifolds of previously seen samples in training. Based on these findings, we introduce a novel evaluation metric for machine unlearning, coined dimensional alignment, which measures the alignment between the eigenspaces of the forget and retain set samples. We employ this metric as a regularizer loss to build a robust and stable unlearning framework, which is further enhanced by integrating a self-distillation loss and an alternating training scheme. Our framework effectively eliminates information from the forget set and preserves knowledge from the retain set. Lastly, we identify critical flaws in established evaluation metrics for machine unlearning, and introduce new evaluation tools that more accurately reflect the fundamental goals of machine unlearning.

Revisiting Machine Unlearning with Dimensional Alignment

TL;DR

Abstract

Paper Structure (47 sections, 3 equations, 5 figures, 10 tables)

This paper contains 47 sections, 3 equations, 5 figures, 10 tables.

Introduction
Preliminaries
Machine unlearning
Setting
Machine Unlearning with Dimensional Alignment (MUDA)
Unlearning as a reverse process of incremental learning
Dimensional alignment
Self-distillation loss for stable projection onto retain feature manifold
Overall framework
Verification of Machine Unlearning
Limitations of existing metrics
Evaluation metrics with semantic information
Linear Probing
F1 and NMI
Experiment
...and 32 more sections

Figures (5)

Figure 1: UMAP visualization of CIFAR-10 train set under the incremental learning scenario, where old and new models are trained with $\mathcal{D}_r$ and $\mathcal{D}_r \cup \mathcal{D}_f$, respectively. Black markers indicate the feature representations of $\mathcal{D}_f$.
Figure 2: Conceptual visualization of dimensional alignment.
Figure 3: Visualizing the training stability. Solid and dashed lines denote the results of LP($\mathcal{D}_r$) and LP($\mathcal{D}_f$). Compared to NegGrad, which requires well-timed early stopping, our framework converges to a stable point.
Figure 4: UMAP visualization of CIFAR-10 train set under a backdoor attack scenario, (a) before unlearning and (b) after unlearning with our framework. Data points with black edges indicate the forget samples, which are poisoned by a backdoor trigger.
Figure 7: Unlearning results on a defending against backdoor attacks, each averaging over five different configurations. The smallest absolute difference compared to the retrained model is highlighted.

Revisiting Machine Unlearning with Dimensional Alignment

TL;DR

Abstract

Revisiting Machine Unlearning with Dimensional Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (5)