Trainwreck: A damaging adversarial attack on image classifiers

Jan Zahálka

Trainwreck: A damaging adversarial attack on image classifiers

Jan Zahálka

TL;DR

The paper introduces damaging adversarial attacks (DAAs) and proposes Trainwreck, a train-time data-poisoning attack that harms image classifiers by conflating similar-class distributions through class-pair universal perturbations. Trainwreck is designed to be stealthy, customizable via a poison rate parameter, and black-box transferable across architectures, achieving potency comparable to state-of-the-art poisoning methods while remaining difficult to detect. It formalizes the DAA threat model, defines a surrogate cost function, and presents a concrete attack pipeline involving estimating class distribution divergence with Jensen-Shannon metrics and constructing CPUPs via targeted PGD. Experimental results on CIFAR-10/100 across multiple architectures demonstrate strong damage and transferability, while the authors propose data-hashing-based defenses and data redundancy as practical mitigations. The work highlights the growing risk of train-time DAAs and provides open-source code to facilitate future research on detection and defense strategies.

Abstract

Adversarial attacks are an important security concern for computer vision (CV). As CV models are becoming increasingly valuable assets in applied practice, disrupting them is emerging as a form of economic sabotage. This paper opens up the exploration of damaging adversarial attacks (DAAs) that seek to damage target CV models. DAAs are formalized by defining the threat model, the cost function DAAs maximize, and setting three requirements for success: potency, stealth, and customizability. As a pioneer DAA, this paper proposes Trainwreck, a train-time attack that conflates the data of similar classes in the training data using stealthy ($ε\leq 8/255$) class-pair universal perturbations obtained from a surrogate model. Trainwreck is a black-box, transferable attack: it requires no knowledge of the target architecture, and a single poisoned dataset degrades the performance of any model trained on it. The experimental evaluation on CIFAR-10 and CIFAR-100 and various model architectures (EfficientNetV2, ResNeXt-101, and a finetuned ViT-L-16) demonstrates Trainwreck's efficiency. Trainwreck achieves similar or better potency compared to the data poisoning state of the art and is fully customizable by the poison rate parameter. Finally, data redundancy with hashing is identified as a reliable defense against Trainwreck or similar DAAs. The code is available at https://github.com/JanZahalka/trainwreck.

Trainwreck: A damaging adversarial attack on image classifiers

TL;DR

Abstract

) class-pair universal perturbations obtained from a surrogate model. Trainwreck is a black-box, transferable attack: it requires no knowledge of the target architecture, and a single poisoned dataset degrades the performance of any model trained on it. The experimental evaluation on CIFAR-10 and CIFAR-100 and various model architectures (EfficientNetV2, ResNeXt-101, and a finetuned ViT-L-16) demonstrates Trainwreck's efficiency. Trainwreck achieves similar or better potency compared to the data poisoning state of the art and is fully customizable by the poison rate parameter. Finally, data redundancy with hashing is identified as a reliable defense against Trainwreck or similar DAAs. The code is available at https://github.com/JanZahalka/trainwreck.

Paper Structure (15 sections, 6 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 15 sections, 6 equations, 3 figures, 2 tables, 2 algorithms.

Introduction
Related work
Damaging adversarial attacks
The Trainwreck attack
Stealth
Estimating class distribution divergence
Class-pair universal perturbations
The full attack
Experimental setup
Baselines
Datasets and parameters
Evaluation protocol
Experimental results
Discussion & defense
Conclusion

Figures (3)

Figure 1: The Trainwreck damaging adversarial attack. The depicted images are actual poisoned CIFAR-10 images successfully damaging classifiers trained on the data.
Figure 2: Examples of two pairs of clean and perturbed images ($\epsilon \leq \frac{8}{255}$).
Figure 3: Test top-1 accuracy results for varying poison rate $\pi$.

Trainwreck: A damaging adversarial attack on image classifiers

TL;DR

Abstract

Trainwreck: A damaging adversarial attack on image classifiers

Authors

TL;DR

Abstract

Table of Contents

Figures (3)