Deep learning models are vulnerable, but adversarial examples are even more vulnerable
Jun Li, Yanwei Xu, Keran Li, Xiaoli Zhang
TL;DR
This work reveals that adversarial examples are intrinsically more sensitive to occlusion than clean inputs. It defines Sliding Mask Confidence Entropy ($H_{ ext{SMCE}}$) to quantify confidence volatility under a sliding $m\times m$ occlusion window, and visualizes this instability with Mask Entropy Field Maps. Building on this, the Sliding Window Masking-Adversarial Example Detection (SWM-AED) detector uses SMCE thresholds to reliably separate adversarial from clean samples without requiring adversarial training, addressing overfitting concerns. Empirical results on CIFAR-10 across multiple architectures and nine attack methods show robust detection (often >80% accuracy) and highlight the benefits of model depth and appropriate mask size for improving performance. The approach provides a scalable, model-aware defense that leverages an intrinsic vulnerability rather than relying solely on prior robustness training, with public code and data for reproducibility.
Abstract
Understanding intrinsic differences between adversarial examples and clean samples is key to enhancing DNN robustness and detection against adversarial attacks. This study first empirically finds that image-based adversarial examples are notably sensitive to occlusion. Controlled experiments on CIFAR-10 used nine canonical attacks (e.g., FGSM, PGD) to generate adversarial examples, paired with original samples for evaluation. We introduce Sliding Mask Confidence Entropy (SMCE) to quantify model confidence fluctuation under occlusion. Using 1800+ test images, SMCE calculations supported by Mask Entropy Field Maps and statistical distributions show adversarial examples have significantly higher confidence volatility under occlusion than originals. Based on this, we propose Sliding Window Mask-based Adversarial Example Detection (SWM-AED), which avoids catastrophic overfitting of conventional adversarial training. Evaluations across classifiers and attacks on CIFAR-10 demonstrate robust performance, with accuracy over 62% in most cases and up to 96.5%.
