Hidden Poison: Machine Unlearning Enables Camouflaged Poisoning Attacks
Jimmy Z. Di, Jack Douglas, Jayadev Acharya, Gautam Kamath, Ayush Sekhari
TL;DR
The paper addresses camouflaged data poisoning attacks that exploit machine unlearning by inserting a small poison set $S_{po}$ and a camouflage set $S_{ca}$ into the training data, such that a subsequent unlearning request deleting $S_{ca}$ reveals the poisoning and causes a targeted misclassification on a test input $(x_{target}, y_{target})$ turning into $y_{adversarial}$. It proposes a gradient-matching framework to efficiently generate poisons (and camouflages) by aligning gradients with respect to a fixed clean model $\theta_{cl}$, and implements two camouflage strategies: label flipping for simple cases and gradient matching for general multiclass settings, with detailed implementation and optimization steps. The method is evaluated on CIFAR-10, Imagenette, and Imagewoof using SVMs and deep networks, demonstrating substantial camouflaging success under multiple threat models ($\varepsilon$, $b_p$, $b_c$) while keeping overall validation accuracy stable, and showing robustness to data augmentation and transfer across models; it also explores approximate unlearning (Amnesiac) and multi-target scenarios. The results emphasize that machine unlearning can introduce a new, timing-sensitive vulnerability where an attacker can trigger misclassification at a chosen moment, underscoring the need for defenses and further research into detection, purification, and robust unlearning techniques.
Abstract
We introduce camouflaged data poisoning attacks, a new attack vector that arises in the context of machine unlearning and other settings when model retraining may be induced. An adversary first adds a few carefully crafted points to the training dataset such that the impact on the model's predictions is minimal. The adversary subsequently triggers a request to remove a subset of the introduced points at which point the attack is unleashed and the model's predictions are negatively affected. In particular, we consider clean-label targeted attacks (in which the goal is to cause the model to misclassify a specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset.
