Protecting against simultaneous data poisoning attacks

Neel Alex; Shoaib Ahmed Siddiqui; Amartya Sanyal; David Krueger

Protecting against simultaneous data poisoning attacks

Neel Alex, Shoaib Ahmed Siddiqui, Amartya Sanyal, David Krueger

TL;DR

This work addresses the realistic threat of multiple simultaneous data poisoning attacks, showing that several backdoors can be inserted into a single model with minimal impact on clean accuracy and that existing defenses perform poorly under this setting. It introduces BaDLoss, a loss-dynamics–based defense that uses a small set of bona fide clean probes to detect and remove anomalous training examples before retraining, achieving strong defense performance with limited degradation to clean accuracy. In extensive experiments on CIFAR-10 and GTSRB, BaDLoss substantially lowers the average attack success rate in the multi-attack setting (to 7.98% and 10.29%, respectively) compared with 64.48% and 84.28% for other defenses, while remaining effective in single-attack scenarios. The work highlights the importance of evaluating defenses under multi-attack scenarios and suggests that loss-trajectory analysis can provide a robust, adaptable defense framework for complex poisoning landscapes.

Abstract

Current backdoor defense methods are evaluated against a single attack at a time. This is unrealistic, as powerful machine learning systems are trained on large datasets scraped from the internet, which may be attacked multiple times by one or more attackers. We demonstrate that simultaneously executed data poisoning attacks can effectively install multiple backdoors in a single model without substantially degrading clean accuracy. Furthermore, we show that existing backdoor defense methods do not effectively prevent attacks in this setting. Finally, we leverage insights into the nature of backdoor attacks to develop a new defense, BaDLoss, that is effective in the multi-attack setting. With minimal clean accuracy degradation, BaDLoss attains an average attack success rate in the multi-attack setting of 7.98% in CIFAR-10 and 10.29% in GTSRB, compared to the average of other defenses at 64.48% and 84.28% respectively.

Protecting against simultaneous data poisoning attacks

TL;DR

Abstract

Paper Structure (26 sections, 2 equations, 8 figures, 2 tables)

This paper contains 26 sections, 2 equations, 8 figures, 2 tables.

Introduction
Related Work
Backdoor Attacks
Backdoor Defenses
Multiple Simultaneous Attacks
Threat Model
Attacks Considered
Efficacy of Multiple Attacks
Defenses Considered
Existing Defense Performance
Developing a New Defense
BaDLoss: Backdoor Detection via Loss Dynamics
Results
Multi-Attack Result
Single-Attack Results
...and 11 more sections

Figures (8)

Figure 1: Complete list of attacks considered in this work. (a) checkerboard pattern trigger (Patch) gu2017badnets, (b) single pixel trigger (Single-Pix) gu2017badnets, (c) random noise blending attack (Blend-R) chen2017backdoor, (d) dimple pattern blending attack (Blend-P) liao2018invisible, (e) sinusoid pattern blending attack (Sinusoid) barni2019sinusoid, (f) optimized-trigger attack (Narcissus) zeng2022narcissus, and (g) frequency-domain attack (Frequency) wang2021frequencybackdoor
Figure 2: Evaluation of Multi-attack Setting. Clean accuracy (blue) should be high, attack success rate (red) should be low. When multiple attacks are simultaneously deployed, clean accuracy suffers slightly but is generally preserved at a high level. All attacks are simultaneously learned by the final model. We also plot the attack success rate against single attacks to illustrate positive and negative interference between attacks.
Figure 3: Defense Performance in the Multi-attack Setting. All evaluated defenses demonstrate failures when evaluated in the multi-attack setting. This holds across datasets. The frequency analysis defense method zeng2022frequency achieves low attack success rate on CIFAR-10, but removes so much data that clean accuracy is reduced to random chance levels. Spectral Signatures provides the best defense overall on CIFAR-10, but fails on GTSRB, and dramatically underperforms BaDLoss (see Figure \ref{['fig:multiattack-badloss']})
Figure 4: BaDLoss Overview. (1) The defender tracks clean examples ("probes") in the training set across multiple short training runs. (2) Every example gets an anomaly score based on its average distance from the bona fide clean examples. The farthest examples are marked as potential backdoors. (3) The defender retrains the model, excluding any examples identified as anomalous. (4) The defender deploys the more robust model.
Figure 5: Average clean trajectory compared to average attack trajectories in CIFAR-10, single-attack setting, 50 epochs. Top row, left to right: 4-pixel patch, single-pix patch, random blended, fixed pattern blended. Bottom row, left to right: Sinusoid blended, narcissus, frequency attack. All backdoor attacks exhibit distinct learning dynamics from clean examples. However, some are learned faster while others are learned slower, making the inductive bias of previous methods li2021antibackdoorkhaddaj2023rethinkinghayase2021spectre inappropriate for defending against general poisoning attacks.
...and 3 more figures

Protecting against simultaneous data poisoning attacks

TL;DR

Abstract

Protecting against simultaneous data poisoning attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (8)