Universal Backdoor Attacks

Benjamin Schneider; Nils Lukas; Florian Kerschbaum

Universal Backdoor Attacks

Benjamin Schneider, Nils Lukas, Florian Kerschbaum

TL;DR

This paper demonstrates Universal Backdoor Attacks that enable misclassification to any target class while poisoning a tiny fraction of training data ($0.15\%$), scaling to thousands of classes. It introduces a pipeline that leverages a surrogate model's latent space and inter-class transferability, encoding class-specific triggers via a compact $n$-bit representation and two trigger styles (patch and blend). The approach achieves high attack success rates on ImageNet-1K and scales to ImageNet-2K/4K/6K, while remaining robust under several defenses and showing strong inter-class transfer effects; defenders face a substantial data-cleaning burden to neutralize such backdoors. The work highlights the need for dataset-wide defenses when training on web-scraped data and provides a link to publicly available code for replication.

Abstract

Web-scraped datasets are vulnerable to data poisoning, which can be used for backdooring deep image classifiers during training. Since training on large datasets is expensive, a model is trained once and re-used many times. Unlike adversarial examples, backdoor attacks often target specific classes rather than any class learned by the model. One might expect that targeting many classes through a naive composition of attacks vastly increases the number of poison samples. We show this is not necessarily true and more efficient, universal data poisoning attacks exist that allow controlling misclassifications from any source class into any target class with a small increase in poison samples. Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6,000 classes while poisoning only 0.15% of the training dataset. Our source code is available at https://github.com/Ben-Schneider-code/Universal-Backdoor-Attacks.

Universal Backdoor Attacks

TL;DR

This paper demonstrates Universal Backdoor Attacks that enable misclassification to any target class while poisoning a tiny fraction of training data (

), scaling to thousands of classes. It introduces a pipeline that leverages a surrogate model's latent space and inter-class transferability, encoding class-specific triggers via a compact

-bit representation and two trigger styles (patch and blend). The approach achieves high attack success rates on ImageNet-1K and scales to ImageNet-2K/4K/6K, while remaining robust under several defenses and showing strong inter-class transfer effects; defenders face a substantial data-cleaning burden to neutralize such backdoors. The work highlights the need for dataset-wide defenses when training on web-scraped data and provides a link to publicly available code for replication.

Abstract

Paper Structure (21 sections, 2 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 2 equations, 6 figures, 7 tables, 1 algorithm.

Introduction
Background
Our Method
Threat Model
Inter-class Poison Transferability
Creating Triggers
Encoding Approach
Experiments
Experimental Setup
Effectiveness on ImageNet-1K
Scaling
Measuring Inter-Class Poison Transferability
Robustness Against Defenses
Measuring the Clean Data Trade-off
Discussion and Related Work
...and 6 more sections

Figures (6)

Figure 1: An overview of a universal poisoning attack pipeline. The CLIP encoder maps images and labels into the same latent space. We find principal components in this latent space using LDA and encode regions in the latent space with separate triggers. During inference, we find latents for a target label via CLIP, project it to the principal components, and generate the trigger corresponding to this point that we apply to the image. Our universal backdoor is agnostic to the trigger pattern used to encode latents, and we showcase a simple binary encoding via QR-code patterns.
Figure 2: Two exemplary methods of encoding latent directions. (Left) Universal Backdoor with a patch trigger encoding. (Right) Universal Backdoor with a blended trigger encoding.
Figure 3: Our attack versus a baseline using patch encoding triggers. We measure the attack success rate and use early stopping at 70 epochs.
Figure 4: Attack success rate on a subset of observed target classes while increasing poisoning in other classes in the dataset.
Figure 5: Clean data as a percentage of the training dataset size required to remove our Universal Backdoor.
...and 1 more figures

Universal Backdoor Attacks

TL;DR

Abstract

Universal Backdoor Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)