Table of Contents
Fetching ...

DiffBreak: Is Diffusion-Based Purification Robust?

Andre Kassis, Urs Hengartner, Yaoliang Yu

TL;DR

This work challenges the core claim that diffusion-based purification (DBP) is robust to adversarial examples by showing that gradient-based adaptive attacks can steer the diffusion score model during purification, producing adversarial outputs rather than clean ones. It introduces DiffBreak and DiffGrad to enable reliable, gradient-informed attacks through DBP, and proposes a majority-vote (MV) robustness estimator to counteract stochasticity, along with a low-frequency (LF) attack that exploits global perturbations. The theoretical result that adaptive attacks can manipulate the purification process undermines prior robustness claims and highlights flaws in one-shot evaluation protocols. The findings demonstrate that current DBP defenses are not viable as standalone solutions and motivate developing purification schemes with private or adversary-inaccessible stochastic dynamics, with DiffBreak providing a standardized toolkit for rigorous future evaluations.

Abstract

Diffusion-based purification (DBP) has become a cornerstone defense against adversarial examples (AEs), regarded as robust due to its use of diffusion models (DMs) that project AEs onto the natural data manifold. We refute this core claim, theoretically proving that gradient-based attacks effectively target the DM rather than the classifier, causing DBP's outputs to align with adversarial distributions. This prompts a reassessment of DBP's robustness, accrediting it two critical factors: inaccurate gradients and improper evaluation protocols that test only a single random purification of the AE. We show that when accounting for stochasticity and resubmission risk, DBP collapses. To support this, we introduce DiffBreak, the first reliable toolkit for differentiation through DBP, eliminating gradient mismatches that previously further inflated robustness estimates. We also analyze the current defense scheme used for DBP where classification relies on a single purification, pinpointing its inherent invalidity. We provide a statistically grounded majority-vote (MV) alternative that aggregates predictions across multiple purified copies, showing partial but meaningful robustness gain. We then propose a novel adaptation of an optimization method against deepfake watermarking, crafting systemic perturbations that defeat DBP even under MV, challenging DBP's viability.

DiffBreak: Is Diffusion-Based Purification Robust?

TL;DR

This work challenges the core claim that diffusion-based purification (DBP) is robust to adversarial examples by showing that gradient-based adaptive attacks can steer the diffusion score model during purification, producing adversarial outputs rather than clean ones. It introduces DiffBreak and DiffGrad to enable reliable, gradient-informed attacks through DBP, and proposes a majority-vote (MV) robustness estimator to counteract stochasticity, along with a low-frequency (LF) attack that exploits global perturbations. The theoretical result that adaptive attacks can manipulate the purification process undermines prior robustness claims and highlights flaws in one-shot evaluation protocols. The findings demonstrate that current DBP defenses are not viable as standalone solutions and motivate developing purification schemes with private or adversary-inaccessible stochastic dynamics, with DiffBreak providing a standardized toolkit for rigorous future evaluations.

Abstract

Diffusion-based purification (DBP) has become a cornerstone defense against adversarial examples (AEs), regarded as robust due to its use of diffusion models (DMs) that project AEs onto the natural data manifold. We refute this core claim, theoretically proving that gradient-based attacks effectively target the DM rather than the classifier, causing DBP's outputs to align with adversarial distributions. This prompts a reassessment of DBP's robustness, accrediting it two critical factors: inaccurate gradients and improper evaluation protocols that test only a single random purification of the AE. We show that when accounting for stochasticity and resubmission risk, DBP collapses. To support this, we introduce DiffBreak, the first reliable toolkit for differentiation through DBP, eliminating gradient mismatches that previously further inflated robustness estimates. We also analyze the current defense scheme used for DBP where classification relies on a single purification, pinpointing its inherent invalidity. We provide a statistically grounded majority-vote (MV) alternative that aggregates predictions across multiple purified copies, showing partial but meaningful robustness gain. We then propose a novel adaptation of an optimization method against deepfake watermarking, crafting systemic perturbations that defeat DBP even under MV, challenging DBP's viability.

Paper Structure

This paper contains 52 sections, 1 theorem, 29 equations, 16 figures, 6 tables, 3 algorithms.

Key Result

Theorem 2.1

The adaptive attack optimizes the entire reverse diffusion process, modifying the parameters $\{\theta^t_{{\bm{x}}}\}_{t\leq t^*}$ such that the output distribution $\hat{{\bm{x}}}(0) \sim \textit{DBP}^{\{\theta^t_{{\bm{x}}}\}}({\bm{x}})$, where $\textit{DBP}^{\{\theta^t_{{\bm{x}}}\}}({\bm{x}})$ is Since $\{\theta^t_{{\bm{x}}}\}_{t\leq t^*}$ depend on the purification trajectory $\hat{{\bm{x}}}_{

Figures (16)

  • Figure 1: Effects of the identified issues in DBP's backpropagation. Each subfigure visualizes a specific error source.
  • Figure 2: Successful attacks generated by LF and AA-$\ell_{\infty}$. Left -original image. Middle - AA. Right - LF.
  • Figure 4: Successful attacks generated by LF and AA-$\ell_{\infty}$. Left -original image. Middle - AA. Right - LF.
  • Figure 6: Successful attacks generated by LF and AA-$\ell_{\infty}$. Left -original image. Middle - AA. Right - LF.
  • Figure 8: Successful attacks generated with LF. Left -original image. Right - LF.
  • ...and 11 more figures

Theorems & Definitions (3)

  • Theorem 2.1
  • proof
  • proof