Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers

Sungmin Cha; Sungjun Cho; Dasol Hwang; Honglak Lee; Taesup Moon; Moontae Lee

Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers

Sungmin Cha, Sungjun Cho, Dasol Hwang, Honglak Lee, Taesup Moon, Moontae Lee

TL;DR

The paper addresses the privacy and robustness need to delete information learned by pre-trained classifiers without full retraining. It proposes an instance-wise unlearning framework that forces misclassification or relabeling of forgetting data using two regularizers: adversarial-example regularization to preserve the representation-level decision boundary and weight-importance regularization (MAS) to limit forgetting on remaining data. The approach defines clear objectives and losses for misclassification and relabeling, demonstrates architecture-agnostic effectiveness across CIFAR-10/100 and ImageNet-1K with continual unlearning scenarios, and shows that it preserves performance on remaining data while deleting the targeted instances. Practically, the method enables compliant data deletion with limited access to original training data, offering a scalable and effective tool for privacy-preserving machine learning in real-world deployments.

Abstract

Since the recent advent of regulations for data protection (e.g., the General Data Protection Regulation), there has been increasing demand in deleting information learned from sensitive data in pre-trained models without retraining from scratch. The inherent vulnerability of neural networks towards adversarial attacks and unfairness also calls for a robust method to remove or correct information in an instance-wise fashion, while retaining the predictive performance across remaining data. To this end, we consider instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model, by either misclassifying each instance away from its original prediction or relabeling the instance to a different label. We also propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information. Both methods only require the pre-trained model and data instances to forget, allowing painless application to real-life settings where the entire training set is unavailable. Through extensive experimentation on various image classification benchmarks, we show that our approach effectively preserves knowledge of remaining data while unlearning given instances in both single-task and continual unlearning scenarios.

Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers

TL;DR

Abstract

Paper Structure (18 sections, 7 equations, 4 figures, 9 tables, 5 algorithms)

This paper contains 18 sections, 7 equations, 4 figures, 9 tables, 5 algorithms.

Introduction
Related Work
Method
Preliminaries and notations
Our proposed framework
Experiments
Experimental Setup
Main Results
Qualitative Analysis
Concluding Remarks
Acknowledgments
Pseudocode of our unlearning pipelines
Additional experimental results
Results on relabeling.
Results with other architectures.
...and 3 more sections

Figures (4)

Figure 1: Illustrations of our approaches that reduce forgetting on the remaining data. (Top) Augmenting adversarial examples from unlearning data provides support for preserving the overall decision boundary. (Bottom) Weight importance measures allow us to pinpoint weights we should change to induce misclassification while maintaining other weights to mitigate forgetting.
Figure 2: Confusion matrices from CIFAR-10 showing average pairwise frequencies of pre- (Y-axis) and post-unlearning (X-axis) prediction labels from $\mathcal{D}_f$. A blue color indicates higher frequency. Our unlearning framework does not produce any discernible correlation in misclassification.
Figure 3: t-SNE plots of CIFAR-10 datapoints in $\mathcal{D}_f$ (triangles) and $\mathcal{D}_r$ (dots) before and after unlearning. Colors indicate true labels for all plots. Regularization with adversarial examples and weight importance effectively preserves the decision boundary while migrating instances in $\mathcal{D}_f$ towards the class boundary to induce misclassification.
Figure 4: Layer-wise CKA correlations on $\mathcal{D}_f$ (top row) and $\mathcal{D}_r$ (bottom row) between representations before (X-axis) and after (Y-axis) unlearning. Brighter color indicates higher CKA correlation. NegGrad results in large forgetting of high-level features in both $\mathcal{D}_f$ and $\mathcal{D}_r$. Our approaches selectively forget high-level features only in $\mathcal{D}_f$.

Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers

TL;DR

Abstract

Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers

Authors

TL;DR

Abstract

Table of Contents

Figures (4)