Table of Contents
Fetching ...

Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks

Quang H. Nguyen, Nguyen Ngoc-Hieu, The-Anh Ta, Thanh Nguyen-Tang, Kok-Seng Wong, Hoang Thanh-Tung, Khoa D. Doan

TL;DR

The paper addresses backdoor risks under a highly constrained setting where the attacker only supplies data for the target class and has no access to other classes or the victim model. It introduces two data-selection strategies—one using pretrained feature spaces and one leveraging out-of-distribution data—to identify hard samples within the target class for poisoning, enabling effective clean-label backdoors with small budgets. Through extensive experiments on CIFAR-10, GTSRB, and face datasets, the proposed methods achieve higher attack success rates while preserving benign accuracy and show resilience against several defenses. The work highlights a practical, under-explored threat in decentralized data pipelines and motivates the community to develop countermeasures tailored to single-class data-provider attacks.

Abstract

Deep neural networks are vulnerable to backdoor attacks, a type of adversarial attack that poisons the training data to manipulate the behavior of models trained on such data. Clean-label attacks are a more stealthy form of backdoor attacks that can perform the attack without changing the labels of poisoned data. Early works on clean-label attacks added triggers to a random subset of the training set, ignoring the fact that samples contribute unequally to the attack's success. This results in high poisoning rates and low attack success rates. To alleviate the problem, several supervised learning-based sample selection strategies have been proposed. However, these methods assume access to the entire labeled training set and require training, which is expensive and may not always be practical. This work studies a new and more practical (but also more challenging) threat model where the attacker only provides data for the target class (e.g., in face recognition systems) and has no knowledge of the victim model or any other classes in the training set. We study different strategies for selectively poisoning a small set of training samples in the target class to boost the attack success rate in this setting. Our threat model poses a serious threat in training machine learning models with third-party datasets, since the attack can be performed effectively with limited information. Experiments on benchmark datasets illustrate the effectiveness of our strategies in improving clean-label backdoor attacks.

Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks

TL;DR

The paper addresses backdoor risks under a highly constrained setting where the attacker only supplies data for the target class and has no access to other classes or the victim model. It introduces two data-selection strategies—one using pretrained feature spaces and one leveraging out-of-distribution data—to identify hard samples within the target class for poisoning, enabling effective clean-label backdoors with small budgets. Through extensive experiments on CIFAR-10, GTSRB, and face datasets, the proposed methods achieve higher attack success rates while preserving benign accuracy and show resilience against several defenses. The work highlights a practical, under-explored threat in decentralized data pipelines and motivates the community to develop countermeasures tailored to single-class data-provider attacks.

Abstract

Deep neural networks are vulnerable to backdoor attacks, a type of adversarial attack that poisons the training data to manipulate the behavior of models trained on such data. Clean-label attacks are a more stealthy form of backdoor attacks that can perform the attack without changing the labels of poisoned data. Early works on clean-label attacks added triggers to a random subset of the training set, ignoring the fact that samples contribute unequally to the attack's success. This results in high poisoning rates and low attack success rates. To alleviate the problem, several supervised learning-based sample selection strategies have been proposed. However, these methods assume access to the entire labeled training set and require training, which is expensive and may not always be practical. This work studies a new and more practical (but also more challenging) threat model where the attacker only provides data for the target class (e.g., in face recognition systems) and has no knowledge of the victim model or any other classes in the training set. We study different strategies for selectively poisoning a small set of training samples in the target class to boost the attack success rate in this setting. Our threat model poses a serious threat in training machine learning models with third-party datasets, since the attack can be performed effectively with limited information. Experiments on benchmark datasets illustrate the effectiveness of our strategies in improving clean-label backdoor attacks.
Paper Structure (28 sections, 3 equations, 7 figures, 14 tables, 1 algorithm)

This paper contains 28 sections, 3 equations, 7 figures, 14 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of our threat model. The attacker acts as a data provider in a supply chain where each data provider is responsible for a data class. The attacker injects a trigger into the images without changing the label and sends them to the victim. The model that is trained on this poisoned dataset behaves normally on clean images but returns the target label when the trigger is added to any image.
  • Figure 2: The attack success rate of SIG on ResNet18/CIFAR10 with $10\%$ of the target class that are harder than the $0, 30, 60$, and $90$-th percentile being poisoned. The horizontal line is the attack success rate where the poisoned set is selected randomly.
  • Figure 3: The feature space of CIFAR10 (left) and GTSRB (right) obtained by t-SNE and VICReg as a feature extractor. Datapoints with the same color have the same label. We can observe that pretrained model divides the training set into clusters corresponding to the labels.
  • Figure 4: EL2N and our score of training samples in class $0$ of CIFAR10. We also illustrate the thresholds where $5\%, 10\%$, and $20\%$ of class $0$ ($0.5\%, 1\%$, and $2\%$ of the training data) is poisoned.
  • Figure 5: Performance against STRIP.
  • ...and 2 more figures