Machine Unlearning Fails to Remove Data Poisoning Attacks
Martin Pawelczyk, Jimmy Z. Di, Yiwei Lu, Gautam Kamath, Ayush Sekhari, Seth Neel
TL;DR
This work critically examines the effectiveness of practical machine unlearning methods in erasing the effects of data poisoning in large-scale models. By evaluating eight unlearning algorithms on vision (e.g., ResNet-18 on CIFAR-10) and language (e.g., GPT-2 on IMDb) tasks across indiscriminate, targeted, backdoor, and a novel Gaussian poisoning, it reveals that none can match full retraining in removing poisons, even with substantial compute budgets. The authors introduce Gaussian poisoning and the Gaussian Unlearning Score to provide a scalable, cross-domain evaluation of unlearning efficacy, showing that standard MIAs can be misleading. Two hypotheses are proposed to explain failures: poisons cause large model shifts and induce updates in a subspace orthogonal to clean data, making gradient-based unlearning ineffective. The findings call for more rigorous evaluations and provable guarantees for unlearning methods to ensure privacy and data integrity in practice.
Abstract
We revisit the efficacy of several practical methods for approximate machine unlearning developed for large-scale deep learning. In addition to complying with data deletion requests, one often-cited potential application for unlearning methods is to remove the effects of poisoned data. We experimentally demonstrate that, while existing unlearning methods have been demonstrated to be effective in a number of settings, they fail to remove the effects of data poisoning across a variety of types of poisoning attacks (indiscriminate, targeted, and a newly-introduced Gaussian poisoning attack) and models (image classifiers and LLMs); even when granted a relatively large compute budget. In order to precisely characterize unlearning efficacy, we introduce new evaluation metrics for unlearning based on data poisoning. Our results suggest that a broader perspective, including a wider variety of evaluations, are required to avoid a false sense of confidence in machine unlearning procedures for deep learning without provable guarantees. Moreover, while unlearning methods show some signs of being useful to efficiently remove poisoned data without having to retrain, our work suggests that these methods are not yet ``ready for prime time,'' and currently provide limited benefit over retraining.
