Table of Contents
Fetching ...

Adversarial Guided Diffusion Models for Adversarial Purification

Guang Lin, Zerui Tao, Jianhai Zhang, Toshihisa Tanaka, Qibin Zhao

TL;DR

Diffusion-model-based adversarial purification can remove perturbations but may distort semantic content. This work proposes Adversarial Guided Diffusion Model (AGDM), which introduces adversarial guidance learned in latent space to preserve semantics while removing perturbations during reverse diffusion, modeled by $p_{\theta,\phi}(x_{t-1}|x_t,y,x') \propto p_\theta(x_{t-1}|x_t) p_\phi(x'|x_t) p_\phi(y|x_t)$ with $p_\phi(x'|x_t) \propto \exp(-\mathcal{D}(c_\phi(x'), c_\phi(x_t)))$ and $p_\phi(y|x_t)=\text{softmax}(c_\phi(x_t))$. The auxiliary network $c_\phi$ is trained via a TRADES-like objective $\min_{\phi} \mathbb{E}_{p_{data}} [ L(c_\phi(x), \overline{y}) + \lambda \max_{\|\delta\| \le \varepsilon} \mathcal{D}(c_\phi(x), c_\phi(x')) ]$ to balance accuracy and robustness. Empirical results on CIFAR-10/100 and ImageNet show AGDM achieving substantial improvements over prior DM-based AP and strong generalization to unseen attacks, demonstrating the practical value of latent-space adversarial guidance for robust purification.

Abstract

Diffusion model (DM) based adversarial purification (AP) has proven to be a powerful defense method that can remove adversarial perturbations and generate a purified example without threats. In principle, the pre-trained DMs can only ensure that purified examples conform to the same distribution of the training data, but it may inadvertently compromise the semantic information of input examples, leading to misclassification of purified examples. Recent advancements introduce guided diffusion techniques to preserve semantic information while removing the perturbations. However, these guidances often rely on distance measures between purified examples and diffused examples, which can also preserve perturbations in purified examples. To further unleash the robustness power of DM-based AP, we propose an adversarial guided diffusion model (AGDM) by introducing a novel adversarial guidance that contains sufficient semantic information but does not explicitly involve adversarial perturbations. The guidance is modeled by an auxiliary neural network obtained with adversarial training, considering the distance in the latent representations rather than at the pixel-level values. Extensive experiments are conducted on CIFAR-10, CIFAR-100 and ImageNet to demonstrate that our method is effective for simultaneously maintaining semantic information and removing the adversarial perturbations. In addition, comprehensive comparisons show that our method significantly enhances the robustness of existing DM-based AP, with an average robust accuracy improved by up to 7.30% on CIFAR-10.

Adversarial Guided Diffusion Models for Adversarial Purification

TL;DR

Diffusion-model-based adversarial purification can remove perturbations but may distort semantic content. This work proposes Adversarial Guided Diffusion Model (AGDM), which introduces adversarial guidance learned in latent space to preserve semantics while removing perturbations during reverse diffusion, modeled by with and . The auxiliary network is trained via a TRADES-like objective to balance accuracy and robustness. Empirical results on CIFAR-10/100 and ImageNet show AGDM achieving substantial improvements over prior DM-based AP and strong generalization to unseen attacks, demonstrating the practical value of latent-space adversarial guidance for robust purification.

Abstract

Diffusion model (DM) based adversarial purification (AP) has proven to be a powerful defense method that can remove adversarial perturbations and generate a purified example without threats. In principle, the pre-trained DMs can only ensure that purified examples conform to the same distribution of the training data, but it may inadvertently compromise the semantic information of input examples, leading to misclassification of purified examples. Recent advancements introduce guided diffusion techniques to preserve semantic information while removing the perturbations. However, these guidances often rely on distance measures between purified examples and diffused examples, which can also preserve perturbations in purified examples. To further unleash the robustness power of DM-based AP, we propose an adversarial guided diffusion model (AGDM) by introducing a novel adversarial guidance that contains sufficient semantic information but does not explicitly involve adversarial perturbations. The guidance is modeled by an auxiliary neural network obtained with adversarial training, considering the distance in the latent representations rather than at the pixel-level values. Extensive experiments are conducted on CIFAR-10, CIFAR-100 and ImageNet to demonstrate that our method is effective for simultaneously maintaining semantic information and removing the adversarial perturbations. In addition, comprehensive comparisons show that our method significantly enhances the robustness of existing DM-based AP, with an average robust accuracy improved by up to 7.30% on CIFAR-10.
Paper Structure (20 sections, 25 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 20 sections, 25 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: The scheme of diffusion-based purification. The clean examples (CEs) or adversarial examples (AEs) are firstly diffused with Gaussian noises and then removed the noise step by step. To make a clearer comparison, we set the step to 400. Unlike previous methods, our method can generate purified examples without changing its semantic information as well as groundtruth label.
  • Figure 2: Overview of the forward process and reverse process. Different colored dots represent the data distributions of various categories. In the presence of attacks, without guidance or with improper guidance, the red star may move to the wrong category, thereby reducing robust accuracy.
  • Figure 3: Comparison of robust accuracy against PGD+EOT and AutoAttack with (a) $l_\infty$ ($\epsilon=8/255$) threat model and (b) $l_2$ ($\epsilon=0.5$) threat model on CIFAR-10 with WideResNet-28-10. The line in the middle of the box represents the average robust accuracy of two attacks. (c) Accurcy-robustness trade-off against $l_2$ ($\epsilon=0.5$) threat model discussed in \ref{['sec:atop']}.
  • Figure 4: Clean examples (Top), adversarial examples (Middle) and purified examples (Bottom) of CIFAR-10.
  • Figure 5: Clean examples (Top), adversarial examples (Middle) and purified examples (Bottom) of ImageNet.
  • ...and 1 more figures