Clean-image Backdoor Attacks

Dazhong Rong; Guoyao Yu; Shuheng Shen; Xinyi Fu; Peng Qian; Jianhai Chen; Qinming He; Xing Fu; Weiqiang Wang

Clean-image Backdoor Attacks

Dazhong Rong, Guoyao Yu, Shuheng Shen, Xinyi Fu, Peng Qian, Jianhai Chen, Qinming He, Xing Fu, Weiqiang Wang

TL;DR

This work addresses the vulnerability of outsourced image labeling by showing that backdoors can be implanted without modifying training images, via poisoning a fraction of labels ($\beta$) and leveraging a kernel-based trigger feature. The attacker partitions the data with a simple, learnable score $g$ and a binary feature $f$, creating a backdoor class $y^T$ that can be activated by natural inputs or tiny perturbations within a bound $\lambda$. They formulate the problem formally, introduce two trigger-construction strategies (randomizing and learning), and demonstrate strong effectiveness and stealth across MNIST and CIFAR-10 with diverse architectures. The results highlight practical risks to fairness and robustness in outsourced labeling pipelines and motivate further work on defenses against clean-image backdoor attacks.

Abstract

To gather a significant quantity of annotated training data for high-performance image classification models, numerous companies opt to enlist third-party providers to label their unlabeled data. This practice is widely regarded as secure, even in cases where some annotated errors occur, as the impact of these minor inaccuracies on the final performance of the models is negligible and existing backdoor attacks require attacker's ability to poison the training images. Nevertheless, in this paper, we propose clean-image backdoor attacks which uncover that backdoors can still be injected via a fraction of incorrect labels without modifying the training images. Specifically, in our attacks, the attacker first seeks a trigger feature to divide the training images into two parts: those with the feature and those without it. Subsequently, the attacker falsifies the labels of the former part to a backdoor class. The backdoor will be finally implanted into the target model after it is trained on the poisoned data. During the inference phase, the attacker can activate the backdoor in two ways: slightly modifying the input image to obtain the trigger feature, or taking an image that naturally has the trigger feature as input. We conduct extensive experiments to demonstrate the effectiveness and practicality of our attacks. According to the experimental results, we conclude that our attacks seriously jeopardize the fairness and robustness of image classification models, and it is necessary to be vigilant about the incorrect labels in outsourced labeling.

Clean-image Backdoor Attacks

TL;DR

This work addresses the vulnerability of outsourced image labeling by showing that backdoors can be implanted without modifying training images, via poisoning a fraction of labels (

) and leveraging a kernel-based trigger feature. The attacker partitions the data with a simple, learnable score

and a binary feature

, creating a backdoor class

that can be activated by natural inputs or tiny perturbations within a bound

. They formulate the problem formally, introduce two trigger-construction strategies (randomizing and learning), and demonstrate strong effectiveness and stealth across MNIST and CIFAR-10 with diverse architectures. The results highlight practical risks to fairness and robustness in outsourced labeling pipelines and motivate further work on defenses against clean-image backdoor attacks.

Abstract

Paper Structure (20 sections, 4 equations, 4 figures)

This paper contains 20 sections, 4 equations, 4 figures.

Introduction
Related Work
Visible Backdoor Attacks
Invisible Backdoor Attacks
Clean-label Backdoor Attacks
Clean-image Backdoor Attacks
Preliminaries
Problem Formulation
Attack Limitations
Clean-image Backdoor Attacks
Core Idea and Challenges
Function Design
Attack Workflows
Experiments
Experimental Setup
...and 5 more sections

Figures (4)

Figure 1: Original clean images (top) and generated poisoned images (bottom)
Figure 2: Performance across different datasets and models
Figure 4: Original class distribution of falsified-label images
Figure 5: ASR under different attack strategies

Clean-image Backdoor Attacks

TL;DR

Abstract

Clean-image Backdoor Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)