Defending Against Frequency-Based Attacks with Diffusion Models
Fatemeh Amerehi, Patrick Healy
TL;DR
The paper addresses robustness of vision models to unseen, frequency-based adversarial perturbations. It proposes diffusion-based adversarial purification that diffuses noisy inputs forward to a small timestep $t^*$ and then denoises via the reverse diffusion (VP-SDE) to produce purified inputs for classification. Experiments on ImageNet across ResNet-50, ViT-B-16, and Swin-B show substantial robustness gains against both spectral and spatial attacks with only modest clean-accuracy losses, often surpassing adversarial training in robustness. The work highlights the practical potential of diffusion purification for generalizing to unseen threat models and data shifts, with the diffusion timestep $t^*$ balancing robustness and accuracy.
Abstract
Adversarial training is a common strategy for enhancing model robustness against adversarial attacks. However, it is typically tailored to the specific attack types it is trained on, limiting its ability to generalize to unseen threat models. Adversarial purification offers an alternative by leveraging a generative model to remove perturbations before classification. Since the purifier is trained independently of both the classifier and the threat models, it is better equipped to handle previously unseen attack scenarios. Diffusion models have proven highly effective for noise purification, not only in countering pixel-wise adversarial perturbations but also in addressing non-adversarial data shifts. In this study, we broaden the focus beyond pixel-wise robustness to explore the extent to which purification can mitigate both spectral and spatial adversarial attacks. Our findings highlight its effectiveness in handling diverse distortion patterns across low- to high-frequency regions.
