Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors
Raz Lapid, Almog Dubin, Moshe Sipper
TL;DR
RADAR introduces a paradigm shift by adversarially training detection systems rather than classifiers to withstand adaptive attacks. The method combines standard classifier objectives with detector objectives and uses selective and orthogonal PGD variants to stabilize training. Empirical results across CIFAR-10, SVHN, and a subset of ImageNet show that RADAR markedly improves detector robustness against adaptive attacks (high ROC-AUC, low SR@5) while preserving clean accuracy and generalizing across unseen classifiers. This work suggests robust adversarial detectors can be a practical, high-impact element of secure ML systems.
Abstract
This paper presents RADAR-Robust Adversarial Detection via Adversarial Retraining-an approach designed to enhance the robustness of adversarial detectors against adaptive attacks, while maintaining classifier performance. An adaptive attack is one where the attacker is aware of the defenses and adapts their strategy accordingly. Our proposed method leverages adversarial training to reinforce the ability to detect attacks, without compromising clean accuracy. During the training phase, we integrate into the dataset adversarial examples, which were optimized to fool both the classifier and the adversarial detector, enabling the adversarial detector to learn and adapt to potential attack scenarios. Experimental evaluations on the CIFAR-10 and SVHN datasets demonstrate that our proposed algorithm significantly improves a detector's ability to accurately identify adaptive adversarial attacks -- without sacrificing clean accuracy.
