Table of Contents
Fetching ...

Adversarial Examples are Misaligned in Diffusion Model Manifolds

Peter Lorenz, Ricard Durall, Janis Keuper

TL;DR

This study systematically examines the alignment of the distributions of adversarial examples when subjected to the process of transformation using diffusion models, providing compelling evidence that adversarial instances do not align with the learned manifold of the DMs.

Abstract

In recent years, diffusion models (DMs) have drawn significant attention for their success in approximating data distributions, yielding state-of-the-art generative results. Nevertheless, the versatility of these models extends beyond their generative capabilities to encompass various vision applications, such as image inpainting, segmentation, adversarial robustness, among others. This study is dedicated to the investigation of adversarial attacks through the lens of diffusion models. However, our objective does not involve enhancing the adversarial robustness of image classifiers. Instead, our focus lies in utilizing the diffusion model to detect and analyze the anomalies introduced by these attacks on images. To that end, we systematically examine the alignment of the distributions of adversarial examples when subjected to the process of transformation using diffusion models. The efficacy of this approach is assessed across CIFAR-10 and ImageNet datasets, including varying image sizes in the latter. The results demonstrate a notable capacity to discriminate effectively between benign and attacked images, providing compelling evidence that adversarial instances do not align with the learned manifold of the DMs.

Adversarial Examples are Misaligned in Diffusion Model Manifolds

TL;DR

This study systematically examines the alignment of the distributions of adversarial examples when subjected to the process of transformation using diffusion models, providing compelling evidence that adversarial instances do not align with the learned manifold of the DMs.

Abstract

In recent years, diffusion models (DMs) have drawn significant attention for their success in approximating data distributions, yielding state-of-the-art generative results. Nevertheless, the versatility of these models extends beyond their generative capabilities to encompass various vision applications, such as image inpainting, segmentation, adversarial robustness, among others. This study is dedicated to the investigation of adversarial attacks through the lens of diffusion models. However, our objective does not involve enhancing the adversarial robustness of image classifiers. Instead, our focus lies in utilizing the diffusion model to detect and analyze the anomalies introduced by these attacks on images. To that end, we systematically examine the alignment of the distributions of adversarial examples when subjected to the process of transformation using diffusion models. The efficacy of this approach is assessed across CIFAR-10 and ImageNet datasets, including varying image sizes in the latter. The results demonstrate a notable capacity to discriminate effectively between benign and attacked images, providing compelling evidence that adversarial instances do not align with the learned manifold of the DMs.
Paper Structure (21 sections, 15 equations, 18 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 15 equations, 18 figures, 5 tables, 1 algorithm.

Figures (18)

  • Figure 1: Illustration of the difference between an adversarial and a benign sample when subjected to the transformation process. $p_a({\bm{x}})$ represents the distribution of adversarial images, while $p_b({\bm{x}})$ represents the distribution of benign images. Using the inversion and reversion process of a pre-trained DDIM song2020denoising, ${\bm{x}}_a$ and ${\bm{x}}_b$ become ${\bm{x}}_a'$ and ${\bm{x}}_b'$, respectively. These transformed counterparts now belong to distinct distributions, namely $p_a'({\bm{x}})$ and $p_b'({\bm{x}})$, characterized by a significantly reduced overlap. Therefore, this results in a distinct representation of adversarial samples compared to benign samples.
  • Figure 2: Illustration from the data generation over the transformation through a pre-trained DDIM to train a binary classifier $C$. Adversarial and benign samples are separately transformed. The transformation implies that the input image $\mathbf{x}_0$ is first gradually inverted into a noise image $\mathbf{x}_T$ using DDIM inversion song2020denoising, and then it is denoised step by step until the transformed $\mathbf{x}_0'$ is obtained, as illustrated in \ref{['eq:reversed']}.
  • Figure 3: Left: Identification of with-box attacks. The classifier has trouble distinguishing PGD, AA, and DF. Benign examples can be clearly distinguished from attacked ones. Center: Identification of black-box attacks. The classifier can clearly distinguish the data transformations of each attack method. Right: Transferability of a binary classifier trained on white-box attacks ($\epsilon=1/255$; without Masked PGD) and tested on all other datasets, as indicated on the x-axis.
  • Figure 4: Transformation of white-box attacked images.
  • Figure 5: Transformation of black-box attacked images.
  • ...and 13 more figures