Potion: Towards Poison Unlearning

Stefan Schoepf; Jack Foster; Alexandra Brintrup

Potion: Towards Poison Unlearning

Stefan Schoepf, Jack Foster, Alexandra Brintrup

TL;DR

A novel outlier-resistant method, based on SSD, is introduced that significantly improves model protection and unlearning performance and Poison Trigger Neutralisation (PTN) search is introduced, a fast, parallelisable, hyperparameter search that utilises the characteristic"unlearning versus model protection"trade-off to find suitable hyperparameters in settings where the forget set size is unknown and the retain set is contaminated.

Abstract

Adversarial attacks by malicious actors on machine learning systems, such as introducing poison triggers into training datasets, pose significant risks. The challenge in resolving such an attack arises in practice when only a subset of the poisoned data can be identified. This necessitates the development of methods to remove, i.e. unlearn, poison triggers from already trained models with only a subset of the poison data available. The requirements for this task significantly deviate from privacy-focused unlearning where all of the data to be forgotten by the model is known. Previous work has shown that the undiscovered poisoned samples lead to a failure of established unlearning methods, with only one method, Selective Synaptic Dampening (SSD), showing limited success. Even full retraining, after the removal of the identified poison, cannot address this challenge as the undiscovered poison samples lead to a reintroduction of the poison trigger in the model. Our work addresses two key challenges to advance the state of the art in poison unlearning. First, we introduce a novel outlier-resistant method, based on SSD, that significantly improves model protection and unlearning performance. Second, we introduce Poison Trigger Neutralisation (PTN) search, a fast, parallelisable, hyperparameter search that utilises the characteristic "unlearning versus model protection" trade-off to find suitable hyperparameters in settings where the forget set size is unknown and the retain set is contaminated. We benchmark our contributions using ResNet-9 on CIFAR10 and WideResNet-28x10 on CIFAR100. Experimental results show that our method heals 93.72% of poison compared to SSD with 83.41% and full retraining with 40.68%. We achieve this while also lowering the average model accuracy drop caused by unlearning from 5.68% (SSD) to 1.41% (ours).

Potion: Towards Poison Unlearning

TL;DR

Abstract

Paper Structure (20 sections, 7 equations, 19 figures, 9 tables, 1 algorithm)

This paper contains 20 sections, 7 equations, 19 figures, 9 tables, 1 algorithm.

Introduction
Related work & background
Problem setting and notation
Differences from privacy-oriented unlearning causing method failure
SSD-based methods
Hyperparameter selection in poison unlearning
Proposed method
Outlier resistant parameter importance estimation with XLF
Hyperparameter search for poison contaminated data with PTN
Experimental setup
goel2024corrective benchmarks
Additional benchmarks
Results and discussion
Poison unlearning and model protection
Ablation results
...and 5 more sections

Figures (19)

Figure 1: The figure adapted from goel2024corrective highlights that while in traditional unlearning tasks retraining from scratch is the gold standard, retraining fails in the data poisoning setting where the unidentified remaining poison in the training data reintroduces the poison trigger to the new model.
Figure 2: Illustrative example of an adversary introducing a poison trigger (bottom-right corner) to steer a model into dangerous behaviour when detecting a stop sign.
Figure 3: Heavy tails of near zero importance values increase the risk for wrongfully determined high relative parameter importances with noisy importance estimates. This then leads to the wrongful modification of parameters that are essential for the model's performance and thus causes model damage.
Figure 4: Min-max scaled distribution comparison of squaring and not squaring the l2 norm of the model output for importance estimation as denoted by $l^2_2(f(x_k,\theta))$ in eq. \ref{['eq:approx']}
Figure 5: Min-max scaled distribution comparison of taking the square-root ($l_2^{0.5}$) and cubing ($l_2^{3}$) the l2 norm of the model output for importance estimation.
...and 14 more figures

Potion: Towards Poison Unlearning

TL;DR

Abstract

Potion: Towards Poison Unlearning

Authors

TL;DR

Abstract

Table of Contents

Figures (19)