Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors

Raz Lapid; Almog Dubin; Moshe Sipper

Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors

Raz Lapid, Almog Dubin, Moshe Sipper

TL;DR

RADAR introduces a paradigm shift by adversarially training detection systems rather than classifiers to withstand adaptive attacks. The method combines standard classifier objectives with detector objectives and uses selective and orthogonal PGD variants to stabilize training. Empirical results across CIFAR-10, SVHN, and a subset of ImageNet show that RADAR markedly improves detector robustness against adaptive attacks (high ROC-AUC, low SR@5) while preserving clean accuracy and generalizing across unseen classifiers. This work suggests robust adversarial detectors can be a practical, high-impact element of secure ML systems.

Abstract

This paper presents RADAR-Robust Adversarial Detection via Adversarial Retraining-an approach designed to enhance the robustness of adversarial detectors against adaptive attacks, while maintaining classifier performance. An adaptive attack is one where the attacker is aware of the defenses and adapts their strategy accordingly. Our proposed method leverages adversarial training to reinforce the ability to detect attacks, without compromising clean accuracy. During the training phase, we integrate into the dataset adversarial examples, which were optimized to fool both the classifier and the adversarial detector, enabling the adversarial detector to learn and adapt to potential attack scenarios. Experimental evaluations on the CIFAR-10 and SVHN datasets demonstrate that our proposed algorithm significantly improves a detector's ability to accurately identify adaptive adversarial attacks -- without sacrificing clean accuracy.

Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors

TL;DR

Abstract

Paper Structure (9 sections, 15 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 9 sections, 15 equations, 8 figures, 7 tables, 1 algorithm.

Introduction
Previous Work
Methodology
Threat Model
Problem Definition
Experimental Framework
Results
Ablation Studies
Conclusions

Figures (8)

Figure 1: General scheme of adversarial attacks. $x$: original image. $x'_{\texttt{adv}}$: standard adversarial attack. $x"_{\texttt{adv}}$: adaptive adversarial attack, targeting both $f_\theta$ (classifier) and $g_\phi$ (detector). The attacker's goal is to fool the classifier into misclassifying the image and at the same time fool the detector into reporting the attack as benign (i.e., fail to recognize an attack).
Figure 2: Generalization performance of adversarially trained detectors, trained on CIFAR-10, SVHN and ImageNet. Each adversarial detector was trained using each corresponding classifier, e.g. ResNet-50 adversarial detector was trained using ResNet-50 image classifier. This table shows the generalization of each detector to other classifiers, which it didn't train with. A value represents the ROC-AUC of the respective detector/classifier pair, for OPGD (top row) and SPGD (bottom row) with $\epsilon=\frac{16}{255}$.
Figure 3: CIFAR-10. Binary cross-entropy loss metrics, from the point of view of an attacker, are herein presented in the context of crafting an adversarial instance from the test set. These plots illustrate the progression of loss over 20 different images of orthogonal projected gradient descent (OPGD), with the main goal being to minimize the loss. Top: Prior to adversarial training, the loss converges to zero after a small number of iterations. Bottom: After adversarial training, the incurred losses are significantly higher by orders of magnitude (note the difference in scales), compared to those observed in their standard counterparts. This shows that the detector is now resilient, i.e., far harder to fool.
Figure 4: SVHN. Binary cross-entropy loss metrics, from the point of view of an attacker, are herein presented in the context of crafting an adversarial instance from the test set.
Figure 5: ImageNet. Binary cross-entropy loss metrics, from the point of view of an attacker, are herein presented in the context of crafting an adversarial instance from the test set using OPGD.
...and 3 more figures

Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors

TL;DR

Abstract

Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors

Authors

TL;DR

Abstract

Table of Contents

Figures (8)