Table of Contents
Fetching ...

Hidden Poison: Machine Unlearning Enables Camouflaged Poisoning Attacks

Jimmy Z. Di, Jack Douglas, Jayadev Acharya, Gautam Kamath, Ayush Sekhari

TL;DR

The paper addresses camouflaged data poisoning attacks that exploit machine unlearning by inserting a small poison set $S_{po}$ and a camouflage set $S_{ca}$ into the training data, such that a subsequent unlearning request deleting $S_{ca}$ reveals the poisoning and causes a targeted misclassification on a test input $(x_{target}, y_{target})$ turning into $y_{adversarial}$. It proposes a gradient-matching framework to efficiently generate poisons (and camouflages) by aligning gradients with respect to a fixed clean model $\theta_{cl}$, and implements two camouflage strategies: label flipping for simple cases and gradient matching for general multiclass settings, with detailed implementation and optimization steps. The method is evaluated on CIFAR-10, Imagenette, and Imagewoof using SVMs and deep networks, demonstrating substantial camouflaging success under multiple threat models ($\varepsilon$, $b_p$, $b_c$) while keeping overall validation accuracy stable, and showing robustness to data augmentation and transfer across models; it also explores approximate unlearning (Amnesiac) and multi-target scenarios. The results emphasize that machine unlearning can introduce a new, timing-sensitive vulnerability where an attacker can trigger misclassification at a chosen moment, underscoring the need for defenses and further research into detection, purification, and robust unlearning techniques.

Abstract

We introduce camouflaged data poisoning attacks, a new attack vector that arises in the context of machine unlearning and other settings when model retraining may be induced. An adversary first adds a few carefully crafted points to the training dataset such that the impact on the model's predictions is minimal. The adversary subsequently triggers a request to remove a subset of the introduced points at which point the attack is unleashed and the model's predictions are negatively affected. In particular, we consider clean-label targeted attacks (in which the goal is to cause the model to misclassify a specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset.

Hidden Poison: Machine Unlearning Enables Camouflaged Poisoning Attacks

TL;DR

The paper addresses camouflaged data poisoning attacks that exploit machine unlearning by inserting a small poison set and a camouflage set into the training data, such that a subsequent unlearning request deleting reveals the poisoning and causes a targeted misclassification on a test input turning into . It proposes a gradient-matching framework to efficiently generate poisons (and camouflages) by aligning gradients with respect to a fixed clean model , and implements two camouflage strategies: label flipping for simple cases and gradient matching for general multiclass settings, with detailed implementation and optimization steps. The method is evaluated on CIFAR-10, Imagenette, and Imagewoof using SVMs and deep networks, demonstrating substantial camouflaging success under multiple threat models (, , ) while keeping overall validation accuracy stable, and showing robustness to data augmentation and transfer across models; it also explores approximate unlearning (Amnesiac) and multi-target scenarios. The results emphasize that machine unlearning can introduce a new, timing-sensitive vulnerability where an attacker can trigger misclassification at a chosen moment, underscoring the need for defenses and further research into detection, purification, and robust unlearning techniques.

Abstract

We introduce camouflaged data poisoning attacks, a new attack vector that arises in the context of machine unlearning and other settings when model retraining may be induced. An adversary first adds a few carefully crafted points to the training dataset such that the impact on the model's predictions is minimal. The adversary subsequently triggers a request to remove a subset of the introduced points at which point the attack is unleashed and the model's predictions are negatively affected. In particular, we consider clean-label targeted attacks (in which the goal is to cause the model to misclassify a specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset.
Paper Structure (40 sections, 9 equations, 11 figures, 15 tables, 2 algorithms)

This paper contains 40 sections, 9 equations, 11 figures, 15 tables, 2 algorithms.

Figures (11)

  • Figure 1: An illustration of a successful camouflaged targeted data poisoning attack. In Step 1, the adversary adds poison and camouflage sets of points to the (clean) training data. In Step 2, the model is trained on the augmented training dataset. It should behave similarly to if trained on only the clean data; in particular, it should correctly classify the targeted point. In Step 3, the adversary triggers an unlearning request to delete the camouflage set from the trained model. In Step 4, the resulting model misclassifies the targeted point.
  • Figure 2: Some representative images from Imagewoof. In each pair, the left figure is from the training dataset, while the right image has been adversarially manipulated. The top and bottom rows are images from the poison and camouflage set, respectively. In all cases, the manipulated images are clean label and nearly indistinguishable from the original image.
  • Figure 3: Efficacy of the proposed camouflaged poisoning attack on CIFAR-10 dataset. The left plot gives the success for the threat model $\varepsilon = 16, b_p = 0.6\%, b_c=0.6\%$ for different neural network architectures. The right plot gives the success for ResNet-18 architecture for different threat models.
  • Figure 4: Efficacy of the proposed camouflaged poisoning attack on CIFAR-10 dataset. The left plot gives the success for the threat model $\varepsilon = 16, b_p = 0.6\%, b_c=0.6\%$ for different neural network architectures. The right plot gives the success for ResNet-18 architecture for different threat models.
  • Figure 5: Visualization of poisons and camouflages on Imagewoof dataset. The first and the third columns shows the original images, and the second and the fourth columns shows the corrupted images (with added $\Delta$). The shown images were generated for a camouflaged poisoning attack on ResNet-18, with Seed = 2111111110, $b_p = b_c = 4.2\%$, $\varepsilon=16$. The target and camouflage class is Austrailian Terrier, and the poison class is Golden Retriever.
  • ...and 6 more figures