Discriminative Adversarial Unlearning

Rohan Sharma; Shijie Zhou; Kaiyi Ji; Changyou Chen

Discriminative Adversarial Unlearning

Rohan Sharma, Shijie Zhou, Kaiyi Ji, Changyou Chen

TL;DR

The paper introduces a discriminative adversarial unlearning framework that casts machine unlearning as a min-max game between a defender and an attacker, leveraging strong Membership Inference Attacks (MIA) signals to erase the influence of forgotten samples while preserving performance. It couples end-to-end differentiable training with a self-supervised regularizer (inspired by Barlow Twins) to align feature spaces across forget and validation sets, enabling effective forgetting without retraining from scratch. Empirical results on CIFAR-10/100 demonstrate near-optimal performance in both random and class-wise forgetting, with class-wise forgetting achieving especially strong forgetting and MIA robustness; sparse networks further reduce runtime with minimal losses. The approach avoids Hessian-based approximations and unrolling, offers significant speedups over retraining (notably 5x+ on some tasks), and provides a flexible framework adaptable to stronger MIA attacks and improved optimization strategies. Overall, the method advances practical, privacy-preserving unlearning for large models, with clear implications for compliance, trust, and sustainable ML deployment.

Abstract

We introduce a novel machine unlearning framework founded upon the established principles of the min-max optimization paradigm. We capitalize on the capabilities of strong Membership Inference Attacks (MIA) to facilitate the unlearning of specific samples from a trained model. We consider the scenario of two networks, the attacker $\mathbf{A}$ and the trained defender $\mathbf{D}$ pitted against each other in an adversarial objective, wherein the attacker aims at teasing out the information of the data to be unlearned in order to infer membership, and the defender unlearns to defend the network against the attack, whilst preserving its general performance. The algorithm can be trained end-to-end using backpropagation, following the well known iterative min-max approach in updating the attacker and the defender. We additionally incorporate a self-supervised objective effectively addressing the feature space discrepancies between the forget set and the validation set, enhancing unlearning performance. Our proposed algorithm closely approximates the ideal benchmark of retraining from scratch for both random sample forgetting and class-wise forgetting schemes on standard machine-unlearning datasets. Specifically, on the class unlearning scheme, the method demonstrates near-optimal performance and comprehensively overcomes known methods over the random sample forgetting scheme across all metrics and multiple network pruning strategies.

Discriminative Adversarial Unlearning

TL;DR

Abstract

and the trained defender

pitted against each other in an adversarial objective, wherein the attacker aims at teasing out the information of the data to be unlearned in order to infer membership, and the defender unlearns to defend the network against the attack, whilst preserving its general performance. The algorithm can be trained end-to-end using backpropagation, following the well known iterative min-max approach in updating the attacker and the defender. We additionally incorporate a self-supervised objective effectively addressing the feature space discrepancies between the forget set and the validation set, enhancing unlearning performance. Our proposed algorithm closely approximates the ideal benchmark of retraining from scratch for both random sample forgetting and class-wise forgetting schemes on standard machine-unlearning datasets. Specifically, on the class unlearning scheme, the method demonstrates near-optimal performance and comprehensively overcomes known methods over the random sample forgetting scheme across all metrics and multiple network pruning strategies.

Paper Structure (20 sections, 4 equations, 2 figures, 2 tables)

This paper contains 20 sections, 4 equations, 2 figures, 2 tables.

Introduction
Related Work
Machine Unlearning
Exact unlearning
Approximate unlearning
Membership Inference Attacks (MIA)
The Proposed Method
The Attacker
The Defender
Adversarial Unlearning
Self-supervised Regularization
Experiments
Criteria
Configuration
Training Adjustments
...and 5 more sections

Figures (2)

Figure 1: Overview of the proposed framework. The figure depicts the interplay between the attacker and defender networks. The defender provides output and sensitivity information to the attacker which in turn provides feedback to the defender for unlearning. The objective is supplemented with a feature space self supervised regularization between the forget and validation sets.
Figure 2: MIA-Attack Robustness. We incorporate additional MIA-Attack methods and compare the robustness of our method against the baselines. The attacks are based on Prediction Confidence, Prediction Entropy, Modified Prediction Entropy and Prediction Probability as elucidated in Section \ref{['sec:mia_metrics']} under Appendix.

Discriminative Adversarial Unlearning

TL;DR

Abstract

Discriminative Adversarial Unlearning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)