Table of Contents
Fetching ...

AMUN: Adversarial Machine UNlearning

Ali Ebrahimpour-Boroojeny, Hari Sundaram, Varun Chandrasekaran

TL;DR

This work tackles privacy-driven machine unlearning by introducing AMUN, which uses fine-tuning on adversarial near-neighbors of forget samples to reduce confidence on the forget subset while preserving test accuracy, thereby mimicking retraining on the remaining data. The authors provide a theoretical bound on parameter updates to elucidate why small Lipschitz constants, tight adversarial proximity, and effective adversarial examples aid unlearning, and they validate AMUN against strong baselines across CIFAR-10 with both access/no-access to the remaining data. Empirical results show AMUN outperforms prior SOTA unlearning methods, remains effective under adversarially robust models, and scales to multiple sequential unlearning requests, with an ablation study confirming the critical role of the adversarial set. The work advances practical, efficient unlearning with privacy-preserving implications, and opens avenues for extending the approach to other domains and formal privacy guarantees.

Abstract

Machine unlearning, where users can request the deletion of a forget dataset, is becoming increasingly important because of numerous privacy regulations. Initial works on ``exact'' unlearning (e.g., retraining) incur large computational overheads. However, while computationally inexpensive, ``approximate'' methods have fallen short of reaching the effectiveness of exact unlearning: models produced fail to obtain comparable accuracy and prediction confidence on both the forget and test (i.e., unseen) dataset. Exploiting this observation, we propose a new unlearning method, Adversarial Machine UNlearning (AMUN), that outperforms prior state-of-the-art (SOTA) methods for image classification. AMUN lowers the confidence of the model on the forget samples by fine-tuning the model on their corresponding adversarial examples. Adversarial examples naturally belong to the distribution imposed by the model on the input space; fine-tuning the model on the adversarial examples closest to the corresponding forget samples (a) localizes the changes to the decision boundary of the model around each forget sample and (b) avoids drastic changes to the global behavior of the model, thereby preserving the model's accuracy on test samples. Using AMUN for unlearning a random $10\%$ of CIFAR-10 samples, we observe that even SOTA membership inference attacks cannot do better than random guessing.

AMUN: Adversarial Machine UNlearning

TL;DR

This work tackles privacy-driven machine unlearning by introducing AMUN, which uses fine-tuning on adversarial near-neighbors of forget samples to reduce confidence on the forget subset while preserving test accuracy, thereby mimicking retraining on the remaining data. The authors provide a theoretical bound on parameter updates to elucidate why small Lipschitz constants, tight adversarial proximity, and effective adversarial examples aid unlearning, and they validate AMUN against strong baselines across CIFAR-10 with both access/no-access to the remaining data. Empirical results show AMUN outperforms prior SOTA unlearning methods, remains effective under adversarially robust models, and scales to multiple sequential unlearning requests, with an ablation study confirming the critical role of the adversarial set. The work advances practical, efficient unlearning with privacy-preserving implications, and opens avenues for extending the approach to other domains and formal privacy guarantees.

Abstract

Machine unlearning, where users can request the deletion of a forget dataset, is becoming increasingly important because of numerous privacy regulations. Initial works on ``exact'' unlearning (e.g., retraining) incur large computational overheads. However, while computationally inexpensive, ``approximate'' methods have fallen short of reaching the effectiveness of exact unlearning: models produced fail to obtain comparable accuracy and prediction confidence on both the forget and test (i.e., unseen) dataset. Exploiting this observation, we propose a new unlearning method, Adversarial Machine UNlearning (AMUN), that outperforms prior state-of-the-art (SOTA) methods for image classification. AMUN lowers the confidence of the model on the forget samples by fine-tuning the model on their corresponding adversarial examples. Adversarial examples naturally belong to the distribution imposed by the model on the input space; fine-tuning the model on the adversarial examples closest to the corresponding forget samples (a) localizes the changes to the decision boundary of the model around each forget sample and (b) avoids drastic changes to the global behavior of the model, thereby preserving the model's accuracy on test samples. Using AMUN for unlearning a random of CIFAR-10 samples, we observe that even SOTA membership inference attacks cannot do better than random guessing.

Paper Structure

This paper contains 37 sections, 1 theorem, 6 equations, 7 figures, 9 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $\mathcal{D} = \{(x_i, y_i)\}_{i=\{1, \dots, N\}}$ be a dataset of $N$ samples and without loss of generality let $(x_n, y_n)$ (henceforth represented as $(x,y)$ for brevity) be the sample that needs to be forgotten and $(x^\prime, y^\prime)$ be its corresponding adversarial example used by AMUN where $C = \ell(f_{\theta_o}(x^\prime), y) + \ell(f_{\theta^\prime}(x^\prime), y^\prime) -\ell(f_{\

Figures (7)

  • Figure 1: Effect of fine-tuning on adversarial examples. This figure shows the effect of fine-tuning on test accuracy of a ResNet-18 model that is trained on CIFAR-10, when the dataset for fine-tuning changes (see § \ref{['sec:ablation']} for details). Let $\mathcal{D_{\text{F}}}$ contain $10\%$ of the samples in $\mathcal{D}$ and $\mathcal{D_{\text{A}}}$ be the set of adversarial examples constructed using Algorithm \ref{['alg:advset']}. Adv, from the left sub-figure to right one, shows the results when $\mathcal{D} \cup \mathcal{D_{\text{A}}}$, $\mathcal{D_{\text{F}}} \cup \mathcal{D_{\text{A}}}$, and $\mathcal{D_{\text{A}}}$ is used for fine-tuning the model, respectively. Orig, Adv-RS, Adv-RL, Orig-RL, and Orig-AdvL shows the results when $\mathcal{D_{\text{A}}}$ for each of these sub-figures is replace by $\mathcal{D_{\text{F}}}$, $\mathcal{D_{\text{A}}}_{RS}$, $\mathcal{D_{\text{A}}}_{RL}$, $\mathcal{D}_{RL}$, and $\mathcal{D}_{AdvL}$, accordingly. As the figure shows, the specific use of adversarial examples with the mis-predicted labels matters in keeping the model's test accuracy because $\mathcal{D_{\text{A}}}$, in contrast to the other constructed datasets belong to the natural distribution learned by the trained model.
  • Figure 2: Multiple unlearning requests. This figure shows the evaluation of unlearning methods when they are used for unlearning for five times and each time on $2\%$ of the training data. We train a ResNet-18 model on CIFAR-10 when $\mathcal{D_{\text{R}}}$ is available (left) and when it is not (right). After each step of the unlearning, we use the MIA scores generated by RMIA to derive the area under the ROC curve (AUC) for $\mathcal{D_{\text{R}}}$ vs. $\mathcal{D_{\text{F}}}$ and $\mathcal{D_{\text{F}}}$ vs. $\mathcal{D_{\text{T}}}$. The values on the y-axis shows the difference of these two AUC scores. A high value for this gap means the samples in $\mathcal{D_{\text{F}}}$ are far more similar to $\mathcal{D_{\text{T}}}$ rather than $\mathcal{D_{\text{R}}}$ and shows a more effective unlearning.
  • Figure 3: (left) These two plots show the histogram of confidence values of the retrained model on its predictions for the remaining set (Remain), test set (Test), and forget set (Forget) during the training, when the size of the forget set is $\%10$ (1st plot) and $\%50$ (2nd plot) of the training set. It also shows the Gaussian distributions fitted to each histogram. As the plots show the models perform similarly on the forget set and test set because to the retrained model they are unseen data from the same distribution. (right) This plot compares the $\delta_x$ value in definition \ref{['def:attack']} for adversarial examples generated on the original ResNet-18 models (x-axis) and clipped ResNet-18 models (y-axis). The dashed line shows $x=y$ line and more than $97\%$ of the values fall bellow this line.
  • Figure 4: The two left-most subplots show the confidence values before and after unlearning (using AMUN) of $10\%$ of the training samples. The two right-most subplots show these confidence values for unlearning $50\%$ of the training samples. In both cases, the confidence values of samples in $\mathcal{D_{\text{F}}}$ are similar to those of $\mathcal{D_{\text{R}}}$ and their fitted Gaussian distribution matches as expected. After using AMUN for unlearning the samples in $\mathcal{D_{\text{F}}}$, the confidence values on this set gets more similar to the test (unseen) samples.
  • Figure 5: This figure shows the effect of fine-tuning on test accuracy of a ResNet-18 model that is trained on CIFAR-10, when the dataset for fine-tuning changes (see § \ref{['sec:ablation']} for details). Let $\mathcal{D_{\text{F}}}$ contain $50\%$ of the samples in $\mathcal{D}$ and $\mathcal{D_{\text{A}}}$ be the set of adversarial examples constructed using Algorithm \ref{['alg:advset']}. Adv, from the left sub-figure to right one, shows the results when $\mathcal{D} \cup \mathcal{D_{\text{A}}}$, $\mathcal{D_{\text{F}}} \cup \mathcal{D_{\text{A}}}$, and $\mathcal{D_{\text{A}}}$ is used for fine-tuning the model, respectively. Orig, Adv-RS, Adv-RL, Orig-RL, and Orig-AdvL shows the results when $\mathcal{D_{\text{A}}}$ for each of these sub-figures is replace by $\mathcal{D_{\text{F}}}$, $\mathcal{D_{\text{A}}}_{RS}$, $\mathcal{D_{\text{A}}}_{RL}$, $\mathcal{D}_{RL}$, and $\mathcal{D}_{AdvL}$, accordingly. As the figure shows, the specific use of adversarial examples with the mis-predicted labels matters in keeping the model's test accuracy because $\mathcal{D_{\text{A}}}$, in contrast to the other constructed datasets belong to the natural distribution learned by the trained model.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 2.1: Attack Algorithm
  • Definition 2.2: Machine Unlearning
  • Theorem 4.1
  • proof