Table of Contents
Fetching ...

Defending Against Frequency-Based Attacks with Diffusion Models

Fatemeh Amerehi, Patrick Healy

TL;DR

The paper addresses robustness of vision models to unseen, frequency-based adversarial perturbations. It proposes diffusion-based adversarial purification that diffuses noisy inputs forward to a small timestep $t^*$ and then denoises via the reverse diffusion (VP-SDE) to produce purified inputs for classification. Experiments on ImageNet across ResNet-50, ViT-B-16, and Swin-B show substantial robustness gains against both spectral and spatial attacks with only modest clean-accuracy losses, often surpassing adversarial training in robustness. The work highlights the practical potential of diffusion purification for generalizing to unseen threat models and data shifts, with the diffusion timestep $t^*$ balancing robustness and accuracy.

Abstract

Adversarial training is a common strategy for enhancing model robustness against adversarial attacks. However, it is typically tailored to the specific attack types it is trained on, limiting its ability to generalize to unseen threat models. Adversarial purification offers an alternative by leveraging a generative model to remove perturbations before classification. Since the purifier is trained independently of both the classifier and the threat models, it is better equipped to handle previously unseen attack scenarios. Diffusion models have proven highly effective for noise purification, not only in countering pixel-wise adversarial perturbations but also in addressing non-adversarial data shifts. In this study, we broaden the focus beyond pixel-wise robustness to explore the extent to which purification can mitigate both spectral and spatial adversarial attacks. Our findings highlight its effectiveness in handling diverse distortion patterns across low- to high-frequency regions.

Defending Against Frequency-Based Attacks with Diffusion Models

TL;DR

The paper addresses robustness of vision models to unseen, frequency-based adversarial perturbations. It proposes diffusion-based adversarial purification that diffuses noisy inputs forward to a small timestep and then denoises via the reverse diffusion (VP-SDE) to produce purified inputs for classification. Experiments on ImageNet across ResNet-50, ViT-B-16, and Swin-B show substantial robustness gains against both spectral and spatial attacks with only modest clean-accuracy losses, often surpassing adversarial training in robustness. The work highlights the practical potential of diffusion purification for generalizing to unseen threat models and data shifts, with the diffusion timestep balancing robustness and accuracy.

Abstract

Adversarial training is a common strategy for enhancing model robustness against adversarial attacks. However, it is typically tailored to the specific attack types it is trained on, limiting its ability to generalize to unseen threat models. Adversarial purification offers an alternative by leveraging a generative model to remove perturbations before classification. Since the purifier is trained independently of both the classifier and the threat models, it is better equipped to handle previously unseen attack scenarios. Diffusion models have proven highly effective for noise purification, not only in countering pixel-wise adversarial perturbations but also in addressing non-adversarial data shifts. In this study, we broaden the focus beyond pixel-wise robustness to explore the extent to which purification can mitigate both spectral and spatial adversarial attacks. Our findings highlight its effectiveness in handling diverse distortion patterns across low- to high-frequency regions.

Paper Structure

This paper contains 9 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Clean, adversarial, diffused, and purified images. The clean image is the original, uncorrupted image from ImageNet. The adversarial image is generated for ResNet-50 by perturbing all components (magnitude, phase, and pixel values) of the image. The adversarial image is then purified using diffusion purification. The diffused image corresponds to $t^* = 0.15$, and the purified image is obtained at $t = 0$.
  • Figure 2: Adversarial examples generated by perturbing the magnitude, phase, and pixel values across various architectures. The perturbations, representing the differences between the original and attacked images (magnified by a factor of 20 for visualization). The distortion histograms, obtained by applying the Fourier transform to the perturbations, highlight the impact of each attack on the spectral characteristics of the images. In ResNet-50 he2016deep, the distortion is primarily concentrated in high-frequency regions, while ViT-B dosovitskiy2021an and Swin-B liu2021swin exhibit distortions mainly in the mid-to-low frequency ranges.
  • Figure 3: Diffusion-driven purification introduces noise to adversarial images by following the forward diffusion process with a small diffusion timestep $t^*$ to obtain the diffused images. These images are then denoised through the reverse denoising process to recover the clean images before classification.
  • Figure 4: Example of purified images for a pixel attack on the ResNet-50 model with $t^* = 0.15$.