Table of Contents
Fetching ...

Machine Unlearning Fails to Remove Data Poisoning Attacks

Martin Pawelczyk, Jimmy Z. Di, Yiwei Lu, Gautam Kamath, Ayush Sekhari, Seth Neel

TL;DR

This work critically examines the effectiveness of practical machine unlearning methods in erasing the effects of data poisoning in large-scale models. By evaluating eight unlearning algorithms on vision (e.g., ResNet-18 on CIFAR-10) and language (e.g., GPT-2 on IMDb) tasks across indiscriminate, targeted, backdoor, and a novel Gaussian poisoning, it reveals that none can match full retraining in removing poisons, even with substantial compute budgets. The authors introduce Gaussian poisoning and the Gaussian Unlearning Score to provide a scalable, cross-domain evaluation of unlearning efficacy, showing that standard MIAs can be misleading. Two hypotheses are proposed to explain failures: poisons cause large model shifts and induce updates in a subspace orthogonal to clean data, making gradient-based unlearning ineffective. The findings call for more rigorous evaluations and provable guarantees for unlearning methods to ensure privacy and data integrity in practice.

Abstract

We revisit the efficacy of several practical methods for approximate machine unlearning developed for large-scale deep learning. In addition to complying with data deletion requests, one often-cited potential application for unlearning methods is to remove the effects of poisoned data. We experimentally demonstrate that, while existing unlearning methods have been demonstrated to be effective in a number of settings, they fail to remove the effects of data poisoning across a variety of types of poisoning attacks (indiscriminate, targeted, and a newly-introduced Gaussian poisoning attack) and models (image classifiers and LLMs); even when granted a relatively large compute budget. In order to precisely characterize unlearning efficacy, we introduce new evaluation metrics for unlearning based on data poisoning. Our results suggest that a broader perspective, including a wider variety of evaluations, are required to avoid a false sense of confidence in machine unlearning procedures for deep learning without provable guarantees. Moreover, while unlearning methods show some signs of being useful to efficiently remove poisoned data without having to retrain, our work suggests that these methods are not yet ``ready for prime time,'' and currently provide limited benefit over retraining.

Machine Unlearning Fails to Remove Data Poisoning Attacks

TL;DR

This work critically examines the effectiveness of practical machine unlearning methods in erasing the effects of data poisoning in large-scale models. By evaluating eight unlearning algorithms on vision (e.g., ResNet-18 on CIFAR-10) and language (e.g., GPT-2 on IMDb) tasks across indiscriminate, targeted, backdoor, and a novel Gaussian poisoning, it reveals that none can match full retraining in removing poisons, even with substantial compute budgets. The authors introduce Gaussian poisoning and the Gaussian Unlearning Score to provide a scalable, cross-domain evaluation of unlearning efficacy, showing that standard MIAs can be misleading. Two hypotheses are proposed to explain failures: poisons cause large model shifts and induce updates in a subspace orthogonal to clean data, making gradient-based unlearning ineffective. The findings call for more rigorous evaluations and provable guarantees for unlearning methods to ensure privacy and data integrity in practice.

Abstract

We revisit the efficacy of several practical methods for approximate machine unlearning developed for large-scale deep learning. In addition to complying with data deletion requests, one often-cited potential application for unlearning methods is to remove the effects of poisoned data. We experimentally demonstrate that, while existing unlearning methods have been demonstrated to be effective in a number of settings, they fail to remove the effects of data poisoning across a variety of types of poisoning attacks (indiscriminate, targeted, and a newly-introduced Gaussian poisoning attack) and models (image classifiers and LLMs); even when granted a relatively large compute budget. In order to precisely characterize unlearning efficacy, we introduce new evaluation metrics for unlearning based on data poisoning. Our results suggest that a broader perspective, including a wider variety of evaluations, are required to avoid a false sense of confidence in machine unlearning procedures for deep learning without provable guarantees. Moreover, while unlearning methods show some signs of being useful to efficiently remove poisoned data without having to retrain, our work suggests that these methods are not yet ``ready for prime time,'' and currently provide limited benefit over retraining.

Paper Structure

This paper contains 43 sections, 14 equations, 11 figures, 7 tables, 4 algorithms.

Figures (11)

  • Figure 1: Standard MIA evaluations are insufficient for detecting unlearning violations.Left: At a low false positive rate (FPR=0.01), standard MIAs have low true positive rates, making them ineffective at identifying whether a targeted sample was successfully unlearned. Right: Our proposed Gaussian poison attack achieves a higher true positive rate at the same FPR, improving the detection of unlearning failures. A full trade-off curve comparison is provided in Figure \ref{['fig:standard_mia_fails']}.
  • Figure 2: Unlearning fails to remove Gaussian poisons across a variety of unlearning methods. We poison 1.5% of the training data by adding Gaussian noise with standard deviation $\varepsilon_{p, {\text{IMDb}}}^2 = 0.1$ and $\varepsilon_{p, \text{CIFAR-10}}^2 = 0.32$, respectively. We train/finetune a Resnet18 for 100 epochs and a GPT-2 for 10 epochs on the poisoned training datasets, respectively. Finally, we use $10\%$ of the original compute budget (i.e., 1 or 10 epochs) to unlearn the poisoned points. None of the unlearning methods removes the poisoned points as the orange vertical bars do not match the dashed black retraining benchmark.
  • Figure 3: Unlearning fails to remove targeted and backdoor poisons across a variety of unlearning methods. We poison 1.5% of the training data by adding Witch's Brew poisons GeipingFHCTMG21 to a Resnet-18 trained on CIFAR-10 or instruction poisons wan2023poisoning to a GPT-2 finetuned on IMDb. We then train/finetune a Resnet-18 for 100 epochs and a GPT-2 for 10 epochs on the poisoned training datasets, respectively. In both cases, we use roughly $1/10$ of the original compute budget (10 epochs for CIFAR-10 or 1 epoch for IMDb) to unlearn the poisoned points. None of the considered methods remove the poisoned points.
  • Figure 4: The dot product between normalized clean input gradients and Gaussian samples/poisons is again Gaussian distributed. We are testing if unlearning using NGD with $\sigma_{\text{NGD}}^2=1e-07$ was successful for a Resnet-18 model trained on CIFAR-10 where $\xi \sim \mathcal{N}(0, \varepsilon_p^2 \cdot \mathbb{I}_d)$ with $\varepsilon_p^2=0.32$ was added to a subset of 750 training points (corresponding to 1.5% of the train set) targeted for unlearning. Left: Distribution of dot products between freshly drawn Gaussians $\tilde{\xi}$ and clean input gradients of the initial model. Middle: Distribution of dot products between model poisons $\xi$ and clean input gradients of the initial model. Right: Distribution of dot products between model poisons $\xi$ and clean input gradients of the updated model. The columns demonstrate that the suggested dot product statistic is again Gaussian distributed with $\hat{\sigma}^2 \approx 1$ and a mean parameter $\hat{\mu}$ that varies depending on whether the poison is statistically dependent on the input gradients $\nabla_\mathbf{x} \ell_{\theta_{\textsf{initial}}}(\mathbf{x})$ or $\nabla_\mathbf{x} \ell_{\theta_{\text{updated}}}(\mathbf{x})$. Comparing the left most column to the middle and right columns shows that our test can distinguish between Gaussians $\tilde{\xi}$ that are independent of the model (left panel: the brown histogram matches the density of the standard normal distribution) and poisons $\xi$ dependent on the model since they were included in model training (middle and right panel: the orange and blue histograms match mean shifted Gaussian distributions).
  • Figure 5: Empirical tradeoff curves (solid) match analytical Gaussian tradeoff curves (dashed). We plot the empirical tradeoff curve before and post unlearning the poison when NGD with $\sigma^2_{\text{NGD}}=\text{1-e07}$ is used for unlearning. Next to empirical tradeoff curve (solid), we plot the analytical Gaussian tradeoff curve $G_{\mu} = 1 - \Phi(\Phi^{-1}(1-\text{FPR}) - \mu)$dong2022gaussianleemann2024gaussian and observe that the match between the empirical and Gaussian tradeoff is excellent where $\Phi$ denotes the CDF for a standard normal distribution. To summarize, since the orange and blue solid tradeoff curves are far from the diagonal line, which indicate a random guessing chance to distinguish the model's noise $\xi$ from a freshly drawn Gaussian $\tilde{\xi}$, unlearning was not successful.
  • ...and 6 more figures