Table of Contents
Fetching ...

ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models

Chihan Huang, Hao Tang

TL;DR

ScoreAdv introduces a training-free UAE framework that internalizes adversarial guidance inside diffusion model sampling. By combining gradient-based denoising guidance, ScoreCAM-informed inpainting with a reference image, and lightweight noise optimization, ScoreAdv generates natural, unrestricted adversarial examples that transfer across classifiers and recognition systems. The approach achieves state-of-the-art attack success and image quality on ImageNet and CelebA/LFW while remaining computationally efficient, and it remains robust under common defenses. The method is readily extendable to multimodal settings by replacing the scoring function with appropriate encoders, underscoring its broad practical impact for secure machine perception.

Abstract

Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on $\ell_{p}$-norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality, while maintaining inference efficiency. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.

ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models

TL;DR

ScoreAdv introduces a training-free UAE framework that internalizes adversarial guidance inside diffusion model sampling. By combining gradient-based denoising guidance, ScoreCAM-informed inpainting with a reference image, and lightweight noise optimization, ScoreAdv generates natural, unrestricted adversarial examples that transfer across classifiers and recognition systems. The approach achieves state-of-the-art attack success and image quality on ImageNet and CelebA/LFW while remaining computationally efficient, and it remains robust under common defenses. The method is readily extendable to multimodal settings by replacing the scoring function with appropriate encoders, underscoring its broad practical impact for secure machine perception.

Abstract

Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on -norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality, while maintaining inference efficiency. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.

Paper Structure

This paper contains 34 sections, 26 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Adversarial attack results generated by various attack methods. The second row illustrates the difference between the benign and adversarial images. For detailed examination, please enlarge the image.
  • Figure 2: Overall framework of our ScoreAdv to generate adversarial images. The lower part illustrates the trained diffusion model. In each diffusion step, we employ adversarial guidance by first sampling $\bar{\boldsymbol{x}}_{t-1}$ using the diffusion model. Subsequently, adversarial perturbation is introduced based on the gradient derived from the target label and target model. If the content information from a reference image is required, we utilize ScoreCAM to generate its saliency map. We use the diffusion process to obtain $\boldsymbol{x}_{t-1}^{ref}$, which is then weighted and combined with $\tilde{\boldsymbol{x}}_{t-1}$ to produce the input for the next diffusion step $\boldsymbol{x}_{t-1}$.
  • Figure 3: Illustration of ScoreCAM.
  • Figure 4: Visualization of adversarial images and perturbations generated by different attack methods. The upper row displays the adversarial images, while the lower row illustrates the corresponding perturbations.
  • Figure 5: Ablation analysis of ScoreAdv parameters. We selected ResNet-50 as the target model and employed ASR and FID as the evaluation metrics to assess the attack effectiveness and generated image quality, respectively.
  • ...and 1 more figures