Table of Contents
Fetching ...

NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion

Max Collins, Jordan Vice, Tim French, Ajmal Mian

TL;DR

NatADiff tackles the mismatch between perturbation-based adversaries and natural test-time errors by focusing on natural adversarial samples that lie on the data manifold. It introduces adversarial boundary guidance within a diffusion sampling framework to embed adversarial structure from the target class while preserving fidelity. The approach combines time-travel sampling, classifier augmentation, and gradient normalization to improve cross-model transferability. On ImageNet, NatADiff achieves comparable attack success rates to state-of-the-art methods but significantly better transferability and closer alignment with natural errors as measured by FID, highlighting its potential for evaluating and strengthening robustness against naturally occurring misclassifications.

Abstract

Adversarial samples exploit irregularities in the manifold ``learned'' by deep learning models to cause misclassifications. The study of these adversarial samples provides insight into the features a model uses to classify inputs, which can be leveraged to improve robustness against future attacks. However, much of the existing literature focuses on constrained adversarial samples, which do not accurately reflect test-time errors encountered in real-world settings. To address this, we propose `NatADiff', an adversarial sampling scheme that leverages denoising diffusion to generate natural adversarial samples. Our approach is based on the observation that natural adversarial samples frequently contain structural elements from the adversarial class. Deep learning models can exploit these structural elements to shortcut the classification process, rather than learning to genuinely distinguish between classes. To leverage this behavior, we guide the diffusion trajectory towards the intersection of the true and adversarial classes, combining time-travel sampling with augmented classifier guidance to enhance attack transferability while preserving image fidelity. Our method achieves comparable attack success rates to current state-of-the-art techniques, while exhibiting significantly higher transferability across model architectures and better alignment with natural test-time errors as measured by FID. These results demonstrate that NatADiff produces adversarial samples that not only transfer more effectively across models, but more faithfully resemble naturally occurring test-time errors.

NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion

TL;DR

NatADiff tackles the mismatch between perturbation-based adversaries and natural test-time errors by focusing on natural adversarial samples that lie on the data manifold. It introduces adversarial boundary guidance within a diffusion sampling framework to embed adversarial structure from the target class while preserving fidelity. The approach combines time-travel sampling, classifier augmentation, and gradient normalization to improve cross-model transferability. On ImageNet, NatADiff achieves comparable attack success rates to state-of-the-art methods but significantly better transferability and closer alignment with natural errors as measured by FID, highlighting its potential for evaluating and strengthening robustness against naturally occurring misclassifications.

Abstract

Adversarial samples exploit irregularities in the manifold ``learned'' by deep learning models to cause misclassifications. The study of these adversarial samples provides insight into the features a model uses to classify inputs, which can be leveraged to improve robustness against future attacks. However, much of the existing literature focuses on constrained adversarial samples, which do not accurately reflect test-time errors encountered in real-world settings. To address this, we propose `NatADiff', an adversarial sampling scheme that leverages denoising diffusion to generate natural adversarial samples. Our approach is based on the observation that natural adversarial samples frequently contain structural elements from the adversarial class. Deep learning models can exploit these structural elements to shortcut the classification process, rather than learning to genuinely distinguish between classes. To leverage this behavior, we guide the diffusion trajectory towards the intersection of the true and adversarial classes, combining time-travel sampling with augmented classifier guidance to enhance attack transferability while preserving image fidelity. Our method achieves comparable attack success rates to current state-of-the-art techniques, while exhibiting significantly higher transferability across model architectures and better alignment with natural test-time errors as measured by FID. These results demonstrate that NatADiff produces adversarial samples that not only transfer more effectively across models, but more faithfully resemble naturally occurring test-time errors.

Paper Structure

This paper contains 29 sections, 3 theorems, 29 equations, 9 figures, 10 tables, 2 algorithms.

Key Result

Theorem H.1

Let $\boldsymbol{x}_t \in \mathbb{R}^m$, $f(t) : \mathbb{R} \rightarrow \mathbb{R}$ and $g(t) : \mathbb{R} \rightarrow \mathbb{R}$ be continuous functions of $t$, and $\mathop{\cdot} d\boldsymbol{B}_t$ denote an Itô integral with respect to the standard multi-dimensional Brownian motion process. The admits a conditional forward distribution of where $\alpha(\tau, t) = \exp \left( \int_{\tau}^t f(

Figures (9)

  • Figure 1: A comparison of different types of adversarial samples. Green and red borders indicate non-adversarial and adversarial samples, respectively. A dotted border denotes images generated using Stable Diffusion 1.5 StableDiffusion_Rombach2022, while a solid border indicates real-world photographs. (a) Constrained adversarial attacks (PGD Madry2019 used here) add perturbations to clean images. (b) Natural adversarial samples are test-time errors that do not contain perturbations. (c) Adversarial classifier guidance Dai2024 produces constrained adversarial samples, as the difference between images generated with and without the guidance is minimal--their difference amounts to a constrained perturbation. (d) Adversarial samples generated with NatADiff diverge from those generated without NatADiff.
  • Figure 2: Effect of adversarial classifier guidance, classifier augmentations, adversarial boundary guidance, and time-travel sampling on samples generated by Stable Diffusion 1.5 StableDiffusion_Rombach2022. Prompt = "tiger", adversarial target = "toilet paper", victim classifier = ResNet-50 He2015. Classification scores are given for ResNet-50, Inception Inception_Szegedy2014, ViT Dosovitskiy2021, and adversarially trained ResNet-50 and Inception models Kurakin2018. Note: "T": "Tiger", "TP": "Toilet Paper", "TC": "Tiger Cat".
  • Figure 3: Adversarial samples generated using NatADiff targeting ResNet-50 He2015, Inception Inception_Szegedy2014, and ViT Dosovitskiy2021 victim models (see column labels). We report the true class, adversarial target, and classification scores of both the victim and adversarially trained ResNet-50 and Inception models Kurakin2018. Superscripts T and U indicate targeted and untargeted (similarity-based) attacks, respectively.
  • Figure 4: Top: Natural adversarial samples compiled by Hendrycks2021Hendrycks2021 for ImageNet Deng2009 classifiers. The green labels denote the ground-truth classes; the red labels are the classes assigned by a ResNet-50 classifier He2015. Bottom: Heatmap of the ResNet-50 adversarial classifier-guidance Dai2024 gradient with respect to the adversarial classes. Arrows point to structural elements from the adversarial class that affect the ResNet-50 classification.
  • Figure 5: Samples generated by NatADiff using the same random seed while varying $\mu$ from $0.0$ to $0.5$, shown left to right. Green and red labels denote the true and adversarial classes, respectively. Images in (a) and (b) exhibit the dual class phenomenon, where large $\mu$ values cause objects from both the true and adversarial classes to appear. Images in (c) and (d) demonstrate the flipped class phenomenon, where large $\mu$ values causes the sample to fully adopt the adversarial class, suppressing the original class features.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Theorem H.1: Conditional Forward Distribution for Denosing Diffusion
  • proof
  • Lemma H.2: Conditional Forward Alternate Parameterisation
  • proof
  • Theorem H.3: Score-Model Link
  • proof