Table of Contents
Fetching ...

Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models

Takami Sato, Justin Yue, Nanze Chen, Ningfei Wang, Qi Alfred Chen

TL;DR

Diffusion models enable high-quality text-to-image generation but can reveal security risks by embedding non-robust features that persist even when prompts remove human-perceptual cues. The authors define the Natural Denoising Diffusion Attack (NDD) and assemble the NDDA dataset to systematically measure its effectiveness against object detectors, image classifiers, and human observers, including real-world Tesla experiments. They report substantial attack success and transferability, with findings such as $88\%$ detector detection, $93\%$ human stealth, and $73\%$ Tesla transfer. The work provides a dataset and evaluation framework to guide defenses and improve robustness of diffusion-based systems.

Abstract

Denoising probabilistic diffusion models have shown breakthrough performance to generate more photo-realistic images or human-level illustrations than the prior models such as GANs. This high image-generation capability has stimulated the creation of many downstream applications in various areas. However, we find that this technology is actually a double-edged sword: We identify a new type of attack, called the Natural Denoising Diffusion (NDD) attack based on the finding that state-of-the-art deep neural network (DNN) models still hold their prediction even if we intentionally remove their robust features, which are essential to the human visual system (HVS), through text prompts. The NDD attack shows a significantly high capability to generate low-cost, model-agnostic, and transferable adversarial attacks by exploiting the natural attack capability in diffusion models. To systematically evaluate the risk of the NDD attack, we perform a large-scale empirical study with our newly created dataset, the Natural Denoising Diffusion Attack (NDDA) dataset. We evaluate the natural attack capability by answering 6 research questions. Through a user study, we find that it can achieve an 88% detection rate while being stealthy to 93% of human subjects; we also find that the non-robust features embedded by diffusion models contribute to the natural attack capability. To confirm the model-agnostic and transferable attack capability, we perform the NDD attack against the Tesla Model 3 and find that 73% of the physically printed attacks can be detected as stop signs. Our hope is that the study and dataset can help our community be aware of the risks in diffusion models and facilitate further research toward robust DNN models.

Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models

TL;DR

Diffusion models enable high-quality text-to-image generation but can reveal security risks by embedding non-robust features that persist even when prompts remove human-perceptual cues. The authors define the Natural Denoising Diffusion Attack (NDD) and assemble the NDDA dataset to systematically measure its effectiveness against object detectors, image classifiers, and human observers, including real-world Tesla experiments. They report substantial attack success and transferability, with findings such as detector detection, human stealth, and Tesla transfer. The work provides a dataset and evaluation framework to guide defenses and improve robustness of diffusion-based systems.

Abstract

Denoising probabilistic diffusion models have shown breakthrough performance to generate more photo-realistic images or human-level illustrations than the prior models such as GANs. This high image-generation capability has stimulated the creation of many downstream applications in various areas. However, we find that this technology is actually a double-edged sword: We identify a new type of attack, called the Natural Denoising Diffusion (NDD) attack based on the finding that state-of-the-art deep neural network (DNN) models still hold their prediction even if we intentionally remove their robust features, which are essential to the human visual system (HVS), through text prompts. The NDD attack shows a significantly high capability to generate low-cost, model-agnostic, and transferable adversarial attacks by exploiting the natural attack capability in diffusion models. To systematically evaluate the risk of the NDD attack, we perform a large-scale empirical study with our newly created dataset, the Natural Denoising Diffusion Attack (NDDA) dataset. We evaluate the natural attack capability by answering 6 research questions. Through a user study, we find that it can achieve an 88% detection rate while being stealthy to 93% of human subjects; we also find that the non-robust features embedded by diffusion models contribute to the natural attack capability. To confirm the model-agnostic and transferable attack capability, we perform the NDD attack against the Tesla Model 3 and find that 73% of the physically printed attacks can be detected as stop signs. Our hope is that the study and dataset can help our community be aware of the risks in diffusion models and facilitate further research toward robust DNN models.
Paper Structure (21 sections, 13 figures, 12 tables)

This paper contains 21 sections, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Examples of the natural attack capability in diffusion models (row). The images are generated with prompts that intentionally remove essential features to humans while keeping "stop sign" in the prompt. Even without these essential features, object detectors (column) still detect these objects with high scores.
  • Figure 2: Overview of the Natural Denoising Diffusion Attack (NDDA) dataset. We alter or remove the 4 types of robust features partially or entirely. For the stop sign, we alter the text on it considering its importance to be recognized as a stop sign. For each set of robust features, we generate images with 3 diffusion models for 3 object classes.
  • Figure 2: Detection rates of 5 object detectors on the stop sign images in the NDDA dataset generated by the 3 diffusion models. Bold and underline denote highest and lowest scores in each row.
  • Figure 3: Average detection rate of stop sign images over the 5 object detectors for 4 models. The "x" mark means the removed robust features.
  • Figure 4: Averaged normalized Levenshtein distances between the given word and the detected text by OCR. The black bar is the averaged distance of all 5 words; the blue bar is only for the "stop" word.
  • ...and 8 more figures