Table of Contents
Fetching ...

Delta-Influence: Unlearning Poisons via Influence Functions

Wenjie Li, Jiawei Li, Pengcheng Zeng, Christian Schroeder de Witt, Ameya Prabhu, Amartya Sanyal

TL;DR

This work tackles data poisoning by reframing unlearning as a forensic attribution problem: given an affected test example, identify a small set of training samples whose removal eliminates the attack. The authors introduce Delta-Influence, which tracks how a training point's influence on a poisoned test point changes under test-time transformations, exploiting a phenomenon they call influence collapse to flag poisoned data. Evaluated on three attacks (Frequency Trigger, Witches' Brew, BadNet) across CIFAR-10/100 and Imagenette with ResNet-18, Delta-Influence consistently achieves superior unlearning performance with minimal accuracy loss, outperforming five detection baselines and five unlearning methods. The method is shown to be robust across settings and scalable, and the authors provide public code to facilitate adoption and further research.

Abstract

Addressing data integrity challenges, such as unlearning the effects of data poisoning after model training, is necessary for the reliable deployment of machine learning models. State-of-the-art influence functions, such as EK-FAC and TRAK, often fail to accurately attribute abnormal model behavior to the specific poisoned training data responsible for the data poisoning attack. In addition, traditional unlearning algorithms often struggle to effectively remove the influence of poisoned samples, particularly when only a few affected examples can be identified. To address these challenge, we introduce $Δ$-Influence, a novel approach that leverages influence functions to trace abnormal model behavior back to the responsible poisoned training data using as little as just one poisoned test example. $Δ$-Influence applies data transformations that sever the link between poisoned training data and compromised test points without significantly affecting clean data. This allows $Δ$-Influence to detect large negative shifts in influence scores following data transformations, a phenomenon we term as influence collapse, thereby accurately identifying poisoned training data. Unlearning this subset, e.g. through retraining, effectively eliminates the data poisoning. We validate our method across three vision-based poisoning attacks and three datasets, benchmarking against five detection algorithms and five unlearning strategies. We show that $Δ$-Influence consistently achieves the best unlearning across all settings, showing the promise of influence functions for corrective unlearning. Our code is publicly available at: https://github.com/Ruby-a07/delta-influence

Delta-Influence: Unlearning Poisons via Influence Functions

TL;DR

This work tackles data poisoning by reframing unlearning as a forensic attribution problem: given an affected test example, identify a small set of training samples whose removal eliminates the attack. The authors introduce Delta-Influence, which tracks how a training point's influence on a poisoned test point changes under test-time transformations, exploiting a phenomenon they call influence collapse to flag poisoned data. Evaluated on three attacks (Frequency Trigger, Witches' Brew, BadNet) across CIFAR-10/100 and Imagenette with ResNet-18, Delta-Influence consistently achieves superior unlearning performance with minimal accuracy loss, outperforming five detection baselines and five unlearning methods. The method is shown to be robust across settings and scalable, and the authors provide public code to facilitate adoption and further research.

Abstract

Addressing data integrity challenges, such as unlearning the effects of data poisoning after model training, is necessary for the reliable deployment of machine learning models. State-of-the-art influence functions, such as EK-FAC and TRAK, often fail to accurately attribute abnormal model behavior to the specific poisoned training data responsible for the data poisoning attack. In addition, traditional unlearning algorithms often struggle to effectively remove the influence of poisoned samples, particularly when only a few affected examples can be identified. To address these challenge, we introduce -Influence, a novel approach that leverages influence functions to trace abnormal model behavior back to the responsible poisoned training data using as little as just one poisoned test example. -Influence applies data transformations that sever the link between poisoned training data and compromised test points without significantly affecting clean data. This allows -Influence to detect large negative shifts in influence scores following data transformations, a phenomenon we term as influence collapse, thereby accurately identifying poisoned training data. Unlearning this subset, e.g. through retraining, effectively eliminates the data poisoning. We validate our method across three vision-based poisoning attacks and three datasets, benchmarking against five detection algorithms and five unlearning strategies. We show that -Influence consistently achieves the best unlearning across all settings, showing the promise of influence functions for corrective unlearning. Our code is publicly available at: https://github.com/Ruby-a07/delta-influence

Paper Structure

This paper contains 26 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Given an affected test point, our goal is to identify the training points responsible for the poisoning, so that retraining without these points can remove the attack from the model. State-of-the-art methods like EK-FAC grosse_studying_2023 detect only a few poisoned points with low precision, leaving the poisoning effect in the model and causing a large accuracy drop. Our method, $\Delta-\mathrm{Influence}$, outperforms existing approaches by successfully recovering the clean model without sacrificing accuracy.
  • Figure 2: We show the Influence Score Change ($\Delta\mathrm{Infl}({i},{j})$) for 125 poisoned training points (orange) and 49,875 clean training points (light blue) on the Smooth Trigger attack with CIFAR100. Each plot shows the influence score change for a different transformation applied to the affected test image. Our result shows a consistent drop in influence scores for all poisoned examples after transformation, while clean examples exhibit no clear trend.
  • Figure 3: Poison Success Rate and Test Accuracy. This table shows both poison unlearning effectiveness and model utility. A method is considered successful if the poison success rate is below 5%, marked by ✓, with unsuccessful methods marked by $\times$. $\Delta$-Influence is successful in 6/6 cases, while the closest competitors succeed in only 3/6. Additionally, $\Delta$-Influence nearly perfectly preserves test accuracy. Figure structure from pawelczyk_machine_2024.
  • Figure 4: Poison Success Rate and Test Accuracy for Unlearning Methods Applied on Samples Identified by $\Delta-\mathrm{Influence}$. Catastrophic Forgetting (CF) and Exact Unlearning (EU) from goel_2023_adversarial perform best, effectively unlearning poisoned samples while maintaining test accuracy. In contrast, SSD foster_2023_fast and SCRUB kurmanji_2023_towards struggle with false negatives, leading to significant accuracy drops, while BadT chundawat_2023_can fails to unlearn effectively. We recommend EU or CF as strong baselines and highlight the need for future methods to improve robustness against false positives.
  • Figure 5: Scaling to Imagenette. In the top row, results on Imagenette are consistent with previous findings: $\Delta-\mathrm{Influence}$ effectively unlearns all three types of poisons while preserving high test accuracy. In contrast, other detection methods often fail to unlearn or do so at the expense of test accuracy. In the bottom row, EU and CF consistently perform well.
  • ...and 2 more figures