Assessing Robustness via Score-Based Adversarial Image Generation

Marcel Kollovieh; Lukas Gosch; Marten Lienen; Yan Scholten; Leo Schwinn; Stephan Günnemann

Assessing Robustness via Score-Based Adversarial Image Generation

Marcel Kollovieh, Lukas Gosch, Marten Lienen, Yan Scholten, Leo Schwinn, Stephan Günnemann

TL;DR

ScoreAG leverages score-based diffusion models to generate semantics-preserving adversarial examples beyond $\ell_p$-norm bounds, enabling three capabilities: Generative Adversarial Synthesis (GAS) for new adversaries, Generative Adversarial Transformation (GAT) for transforming existing images, and Generative Adversarial Purification (GAP) to purify inputs and improve empirical robustness. By combining the diffusion score with task-specific guidance terms, ScoreAG samples from conditioned distributions on the diffusion manifold, producing realistic, semantics-preserving perturbations. Across CIFAR, TinyImagenet, and ImageNet-Compatible datasets, ScoreAG achieves competitive attack strength and, via GAP, high robust accuracy, often surpassing baselines in purification settings. The work highlights the importance of evaluating robustness under semantics-bound adversaries and demonstrates the potential of diffusion-guided adversarial generation to complement traditional $\ell_p$-norm analyses, while acknowledging limitations in universal unrestricted attack evaluation and purification scope in preprocessing settings.

Abstract

Most adversarial attacks and defenses focus on perturbations within small $\ell_p$-norm constraints. However, $\ell_p$ threat models cannot capture all relevant semantics-preserving perturbations, and hence, the scope of robustness evaluations is limited. In this work, we introduce Score-Based Adversarial Generation (ScoreAG), a novel framework that leverages the advancements in score-based generative models to generate unrestricted adversarial examples that overcome the limitations of $\ell_p$-norm constraints. Unlike traditional methods, ScoreAG maintains the core semantics of images while generating adversarial examples, either by transforming existing images or synthesizing new ones entirely from scratch. We further exploit the generative capability of ScoreAG to purify images, empirically enhancing the robustness of classifiers. Our extensive empirical evaluation demonstrates that ScoreAG improves upon the majority of state-of-the-art attacks and defenses across multiple benchmarks. This work highlights the importance of investigating adversarial examples bounded by semantics rather than $\ell_p$-norm constraints. ScoreAG represents an important step towards more encompassing robustness assessments.

Assessing Robustness via Score-Based Adversarial Image Generation

TL;DR

ScoreAG leverages score-based diffusion models to generate semantics-preserving adversarial examples beyond

-norm bounds, enabling three capabilities: Generative Adversarial Synthesis (GAS) for new adversaries, Generative Adversarial Transformation (GAT) for transforming existing images, and Generative Adversarial Purification (GAP) to purify inputs and improve empirical robustness. By combining the diffusion score with task-specific guidance terms, ScoreAG samples from conditioned distributions on the diffusion manifold, producing realistic, semantics-preserving perturbations. Across CIFAR, TinyImagenet, and ImageNet-Compatible datasets, ScoreAG achieves competitive attack strength and, via GAP, high robust accuracy, often surpassing baselines in purification settings. The work highlights the importance of evaluating robustness under semantics-bound adversaries and demonstrates the potential of diffusion-guided adversarial generation to complement traditional

-norm analyses, while acknowledging limitations in universal unrestricted attack evaluation and purification scope in preprocessing settings.

Abstract

Most adversarial attacks and defenses focus on perturbations within small

-norm constraints. However,

threat models cannot capture all relevant semantics-preserving perturbations, and hence, the scope of robustness evaluations is limited. In this work, we introduce Score-Based Adversarial Generation (ScoreAG), a novel framework that leverages the advancements in score-based generative models to generate unrestricted adversarial examples that overcome the limitations of

-norm constraints. Unlike traditional methods, ScoreAG maintains the core semantics of images while generating adversarial examples, either by transforming existing images or synthesizing new ones entirely from scratch. We further exploit the generative capability of ScoreAG to purify images, empirically enhancing the robustness of classifiers. Our extensive empirical evaluation demonstrates that ScoreAG improves upon the majority of state-of-the-art attacks and defenses across multiple benchmarks. This work highlights the importance of investigating adversarial examples bounded by semantics rather than

-norm constraints. ScoreAG represents an important step towards more encompassing robustness assessments.

Paper Structure (27 sections, 12 equations, 11 figures, 11 tables, 1 algorithm)

This paper contains 27 sections, 12 equations, 11 figures, 11 tables, 1 algorithm.

Introduction
Background
Score-Based Adversarial Generation
Problem Statement.
Generative Adversarial Synthesis
Generative Adversarial Transformation
Generative Adversarial Purification
Experimental Evaluation
Quantitative Results
Qualitative Analysis
Human Study
Related Work
Discussion
Broader Impact
Experimental Setup and Hyperparameters
...and 12 more sections

Figures (11)

Figure 1: Examples of various adversarial attacks on an image of the class "tiger shark" (a). The inset visualizes a heatmap of the strength of the corresponding perturbation. Despite the fact that the perturbation generated by ScoreAG-GAT (b) lies outside of common $\ell_p$-norm constraints ($\ell_\infty=188/255$, $\ell_2=18.47$), it is aware of the semantics: removing a small fish to change the predicted label to "hammer shark". This is in stark contrast to APGD croce2020reliable with matching norm constraints, which either (c) results in highly perceptible and unnatural changes, or (d) fails to preserve image semantics completely. This is an example of Generative Adversarial Transformation (GAS), one of the three use-cases of ScoreAG.
Figure 2: An overview of ScoreAG and its three steps. ScoreAG starts from noise $\mathbf{x}_1$ and iteratively denoises it into an image $\mathbf{x}_0$. It uses the task-specific guidance terms $\nabla_{\mathbf{x}_t} \log p_t(c\mid\mathbf{x}_t)$ and the score function $\nabla_{\mathbf{x}_t}\log p_t(\mathbf{x}_t)$ to guide the process towards the task specific condition $c$. The network $\mathbf{s}_\theta$ is used for approximating the score function $\nabla_{\mathbf{x}_t}\log p_t(\mathbf{x}_t)$ and for the one-step Euler prediction $\hat{\mathbf{x}}_0$.
Figure 3: FID (a) and accuracy (b) for increasing $s_\mathbf{y}$ scales in the synthesis (GAS) setup, and robust accuracy (c) for increasing $s_\mathbf{x}$ scales in the purification (GAP) setup under APGD attack. Classifier: WRN-28-10. The shaded area shows the 95% CI over four seeds.
Figure 4: Examples on the CIFAR10 dataset. Fig. \ref{['fig:cifar_synthesis']} shows the synthesis (GAS) setup and generates images of the classes "horse", "truck", and "deer", which are classified as "automobile", "ship", and "horse", respectively, as $s_\mathbf{y}$ increases. Fig. \ref{['fig:cifar_transformation']} shows the transformation (GAT) setup and transforms images of the classes "ship", "horse", and "dog", into adversarial examples classified as "ship", "deer", and "cat". For $s_\mathbf{x}=32$, the images are outside of common perturbation norms, i.e., $\ell_2=0.5$ and $\ell_\infty=8/255$, but preserve image semantics. We show examples of selected baselines in Fig. \ref{['fig:cifar_baselines']}.
Figure 5: Examples from the CIFAR10 dataset. The figure presents selected baseline images corresponding to the examples in Fig. \ref{['fig:cifar_transformation']}. For ScoreAG-GAT, we used $s_\mathbf{x}=48$. As baselines, we included FAB ($\ell_2=0.5$) and Square ($\ell_\infty=8/255$) to represent restricted attacks, as they achieve the lowest and highest LPIPS scores, respectively. Additionally, we show the two unrestricted baselines, CAA and DiffAttack.
...and 6 more figures

Assessing Robustness via Score-Based Adversarial Image Generation

TL;DR

Abstract

Assessing Robustness via Score-Based Adversarial Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)