Table of Contents
Fetching ...

Towards Sample-specific Backdoor Attack with Clean Labels via Attribute Trigger

Mingyan Zhu, Yiming Li, Junfeng Guo, Tao Wei, Shu-Tao Xia, Zhan Qin

TL;DR

It is argued that the intensity constraint of existing SSBAs is mostly because their trigger patterns are ‘content-irrelevant’ and therefore act as ‘noises’ for both humans and DNNs and proposed to exploit content-relevant features, <inline-formula><tex-math notation="LaTeX"> (human-relied) attributes, as the trigger patterns to design clean-label SSBAs.

Abstract

Currently, sample-specific backdoor attacks (SSBAs) are the most advanced and malicious methods since they can easily circumvent most of the current backdoor defenses. In this paper, we reveal that SSBAs are not sufficiently stealthy due to their poisoned-label nature, where users can discover anomalies if they check the image-label relationship. In particular, we demonstrate that it is ineffective to directly generalize existing SSBAs to their clean-label variants by poisoning samples solely from the target class. We reveal that it is primarily due to two reasons, including \textbf{(1)} the `antagonistic effects' of ground-truth features and \textbf{(2)} the learning difficulty of sample-specific features. Accordingly, trigger-related features of existing SSBAs cannot be effectively learned under the clean-label setting due to their mild trigger intensity required for ensuring stealthiness. We argue that the intensity constraint of existing SSBAs is mostly because their trigger patterns are `content-irrelevant' and therefore act as `noises' for both humans and DNNs. Motivated by this understanding, we propose to exploit content-relevant features, $a.k.a.$ (human-relied) attributes, as the trigger patterns to design clean-label SSBAs. This new attack paradigm is dubbed backdoor attack with attribute trigger (BAAT). Extensive experiments are conducted on benchmark datasets, which verify the effectiveness of our BAAT and its resistance to existing defenses.

Towards Sample-specific Backdoor Attack with Clean Labels via Attribute Trigger

TL;DR

It is argued that the intensity constraint of existing SSBAs is mostly because their trigger patterns are ‘content-irrelevant’ and therefore act as ‘noises’ for both humans and DNNs and proposed to exploit content-relevant features, <inline-formula><tex-math notation="LaTeX"> (human-relied) attributes, as the trigger patterns to design clean-label SSBAs.

Abstract

Currently, sample-specific backdoor attacks (SSBAs) are the most advanced and malicious methods since they can easily circumvent most of the current backdoor defenses. In this paper, we reveal that SSBAs are not sufficiently stealthy due to their poisoned-label nature, where users can discover anomalies if they check the image-label relationship. In particular, we demonstrate that it is ineffective to directly generalize existing SSBAs to their clean-label variants by poisoning samples solely from the target class. We reveal that it is primarily due to two reasons, including \textbf{(1)} the `antagonistic effects' of ground-truth features and \textbf{(2)} the learning difficulty of sample-specific features. Accordingly, trigger-related features of existing SSBAs cannot be effectively learned under the clean-label setting due to their mild trigger intensity required for ensuring stealthiness. We argue that the intensity constraint of existing SSBAs is mostly because their trigger patterns are `content-irrelevant' and therefore act as `noises' for both humans and DNNs. Motivated by this understanding, we propose to exploit content-relevant features, (human-relied) attributes, as the trigger patterns to design clean-label SSBAs. This new attack paradigm is dubbed backdoor attack with attribute trigger (BAAT). Extensive experiments are conducted on benchmark datasets, which verify the effectiveness of our BAAT and its resistance to existing defenses.
Paper Structure (36 sections, 2 theorems, 8 equations, 12 figures, 13 tables)

This paper contains 36 sections, 2 theorems, 8 equations, 12 figures, 13 tables.

Key Result

Theorem 1

Suppose the training dataset consists of $N_b$ benign samples $\{(\bm{x}_i, y_i)\}_{i=1}^{N_b}$ and $N_p$ poisoned samples $\{(\bm{x}_j', y_t)\}_{j=1}^{N_p}$, whose images are i.i.d. sampled from uniform distribution and belonging to $K$ classes. Assume that the DNN $f(\cdot;\bm{\theta})$ is a multi where $\hat{\bm{x}}$ and $\widetilde{\bm{x}}$ are poisoned testing samples of sample-agnostic and s

Figures (12)

  • Figure 1: The limitations of existing sample-specific and clean-label backdoor attacks. The first two poisoned samples are generated by sample-specific attacks, where their anomalies can be noticed by users for their image-label inconsistency (marked in red). The last two ones are produced by clean-label attacks, where detection algorithms can reveal trigger patterns (marked in the red boxes) since they are sample-agnostic. This example indicates that the adversaries should design sample-specific attacks with clean labels to truly fulfill attack stealthiness for they can bypass both human inspection and machine detection.
  • Figure 2: The attack success rate (ASR, %) of WaNet, ISSBA, and their sample-agnostic versions on the ImageNet dataset with respect to the poisoning rate (%).
  • Figure 3: The poisoned images generated by WaNet-C and ISSBA-C with different intensities ($i.e.$, strengths for WaNet-C and amplification factor for ISSBA-C) on the ImageNet dataset. As shown in this figure, all poisoned images with relatively large intensities are suspicious for human inspection due to their blurring and ringing artifacts.
  • Figure 4: The ground-truth trigger pattern and the pattern synthesized by neural cleanse of label-consistent attack.
  • Figure 5: The main pipeline of our backdoor attack with attribute trigger (BAAT). In general, our BAAT consists of three main stages: attack stage, training stage, and inference stage. In the attack stage, the adversaries generate poisoned samples by randomly selecting some benign samples from the target class ($e.g.$, 'Tom’) and reassigning the adversary-specified attribute to a particular value ($e.g.$, changing the hairstyle to 'purple hi-top') using a pre-trained attribute editor. In the training stage, the modified poisoned samples as well as the remaining benign ones are used by the victim to train DNNs. In the inference stage, the adversaries can activate the backdoor implanted in the attacked models by modifying the attribute of given images to adversary-specified one, leading the model to misclassify them into the target class ($e.g.$, the modified images of 'Tina’ and 'Jimmy’ are both misclassified as 'Tom’ due to the purple hi-top hairstyle).
  • ...and 7 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 1
  • proof