Table of Contents
Fetching ...

AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models

Xuelong Dai, Kaisheng Liang, Bin Xiao

TL;DR

AdvDiff addresses the threat of unrestricted adversarial examples by generating them with pre-trained diffusion models rather than perturbing real data. It introduces two adversarial guidance mechanisms that steer the reverse diffusion sampling toward a targeted misclassification while preserving high-quality generation, and it provides theoretical support for these techniques. Empirical results on MNIST and ImageNet show AdvDiff outperforms GAN-based UAE methods and other diffusion-based attacks in both attack effectiveness and sample quality, including resilience against defenses and better transfer to black-box models. The work highlights diffusion models as a powerful platform for UAE generation and underscores the need for robust defenses against unrestricted adversarial threats.

Abstract

Unrestricted adversarial attacks present a serious threat to deep learning models and adversarial defense techniques. They pose severe security problems for deep learning applications because they can effectively bypass defense mechanisms. However, previous attack methods often directly inject Projected Gradient Descent (PGD) gradients into the sampling of generative models, which are not theoretically provable and thus generate unrealistic examples by incorporating adversarial objectives, especially for GAN-based methods on large-scale datasets like ImageNet. In this paper, we propose a new method, called AdvDiff, to generate unrestricted adversarial examples with diffusion models. We design two novel adversarial guidance techniques to conduct adversarial sampling in the reverse generation process of diffusion models. These two techniques are effective and stable in generating high-quality, realistic adversarial examples by integrating gradients of the target classifier interpretably. Experimental results on MNIST and ImageNet datasets demonstrate that AdvDiff is effective in generating unrestricted adversarial examples, which outperforms state-of-the-art unrestricted adversarial attack methods in terms of attack performance and generation quality.

AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models

TL;DR

AdvDiff addresses the threat of unrestricted adversarial examples by generating them with pre-trained diffusion models rather than perturbing real data. It introduces two adversarial guidance mechanisms that steer the reverse diffusion sampling toward a targeted misclassification while preserving high-quality generation, and it provides theoretical support for these techniques. Empirical results on MNIST and ImageNet show AdvDiff outperforms GAN-based UAE methods and other diffusion-based attacks in both attack effectiveness and sample quality, including resilience against defenses and better transfer to black-box models. The work highlights diffusion models as a powerful platform for UAE generation and underscores the need for robust defenses against unrestricted adversarial threats.

Abstract

Unrestricted adversarial attacks present a serious threat to deep learning models and adversarial defense techniques. They pose severe security problems for deep learning applications because they can effectively bypass defense mechanisms. However, previous attack methods often directly inject Projected Gradient Descent (PGD) gradients into the sampling of generative models, which are not theoretically provable and thus generate unrealistic examples by incorporating adversarial objectives, especially for GAN-based methods on large-scale datasets like ImageNet. In this paper, we propose a new method, called AdvDiff, to generate unrestricted adversarial examples with diffusion models. We design two novel adversarial guidance techniques to conduct adversarial sampling in the reverse generation process of diffusion models. These two techniques are effective and stable in generating high-quality, realistic adversarial examples by integrating gradients of the target classifier interpretably. Experimental results on MNIST and ImageNet datasets demonstrate that AdvDiff is effective in generating unrestricted adversarial examples, which outperforms state-of-the-art unrestricted adversarial attack methods in terms of attack performance and generation quality.
Paper Structure (18 sections, 11 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 11 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: The two new guidance techniques in our AdvDiff to generate unrestricted adversarial examples. During the reverse generation process, the adversarial guidance is added at timestep $x_t$, which injects the adversarial objective $y_a$ into the diffusion process. The noise sampling guidance modifies the original noise by increasing the conditional likelihood of $y_a$.
  • Figure 2: Unrestricted adversarial examples generated by the diffusion model. The generated adversarial examples should be visually indistinguishable from clean data with label $y$ but wrongly classified by the target classifier $f$.
  • Figure 3: Adversarial examples on the MNIST dataset. Perturbation-based attack methods generate noise patterns to conduct attacks, while unrestricted adversarial attacks (U-GAN and AdvDiff) are imperceptible to the clean data.
  • Figure 4: Comparisons of unrestricted adversarial attacks between GANs and diffusion models on two datasets.. Left: generated samples from U-GAN (BigGAN for ImageNet dataset). Right: generated samples from AdvDiff. We generate unrestricted adversarial examples on the MNIST "0" label and ImageNet "mushroom" label. U-GAN is more likely to generate adversarial examples with the target label, i.e., examples with red font. However, AdvDiff tends to generate the "false negative" samples by the target classifier by combing features from the target label.
  • Figure 5: Ablation study of the impact of parameters in AdvDiff. The results are generated from the ImageNet dataset against the ResNet50 model. We adopt the ASR and IS scores to show the impact of attack performance and generation quality.