Table of Contents
Fetching ...

Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models

Shawn Shan, Wenxin Ding, Josephine Passananti, Stanley Wu, Haitao Zheng, Ben Y. Zhao

TL;DR

Nightshade reveals a practical vulnerability in diffusion-based text-to-image generation: prompt-specific poisoning can derail responses to targeted prompts with a surprisingly small number of optimized poison samples due to concept sparsity. By aligning poison images to a destination concept and perturbing clean data within a constrained feature space, Nightshade achieves high attack potency, bleed-through to related concepts, and cross-model transferability while evading detection. The work presents extensive evaluations across training-from-scratch and continuous-training scenarios, showing that a modest poison budget can significantly degrade output quality and even destabilize general features when applied broadly. It also discusses defenses and proposes a provocative use-case for IP protection, prompting both technical and policy debates about data licensing and model training safeguards.

Abstract

Data poisoning attacks manipulate training data to introduce unexpected behaviors into machine learning models at training time. For text-to-image generative models with massive training datasets, current understanding of poisoning attacks suggests that a successful attack would require injecting millions of poison samples into their training pipeline. In this paper, we show that poisoning attacks can be successful on generative models. We observe that training data per concept can be quite limited in these models, making them vulnerable to prompt-specific poisoning attacks, which target a model's ability to respond to individual prompts. We introduce Nightshade, an optimized prompt-specific poisoning attack where poison samples look visually identical to benign images with matching text prompts. Nightshade poison samples are also optimized for potency and can corrupt an Stable Diffusion SDXL prompt in <100 poison samples. Nightshade poison effects "bleed through" to related concepts, and multiple attacks can composed together in a single prompt. Surprisingly, we show that a moderate number of Nightshade attacks can destabilize general features in a text-to-image generative model, effectively disabling its ability to generate meaningful images. Finally, we propose the use of Nightshade and similar tools as a last defense for content creators against web scrapers that ignore opt-out/do-not-crawl directives, and discuss possible implications for model trainers and content creators.

Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models

TL;DR

Nightshade reveals a practical vulnerability in diffusion-based text-to-image generation: prompt-specific poisoning can derail responses to targeted prompts with a surprisingly small number of optimized poison samples due to concept sparsity. By aligning poison images to a destination concept and perturbing clean data within a constrained feature space, Nightshade achieves high attack potency, bleed-through to related concepts, and cross-model transferability while evading detection. The work presents extensive evaluations across training-from-scratch and continuous-training scenarios, showing that a modest poison budget can significantly degrade output quality and even destabilize general features when applied broadly. It also discusses defenses and proposes a provocative use-case for IP protection, prompting both technical and policy debates about data licensing and model training safeguards.

Abstract

Data poisoning attacks manipulate training data to introduce unexpected behaviors into machine learning models at training time. For text-to-image generative models with massive training datasets, current understanding of poisoning attacks suggests that a successful attack would require injecting millions of poison samples into their training pipeline. In this paper, we show that poisoning attacks can be successful on generative models. We observe that training data per concept can be quite limited in these models, making them vulnerable to prompt-specific poisoning attacks, which target a model's ability to respond to individual prompts. We introduce Nightshade, an optimized prompt-specific poisoning attack where poison samples look visually identical to benign images with matching text prompts. Nightshade poison samples are also optimized for potency and can corrupt an Stable Diffusion SDXL prompt in <100 poison samples. Nightshade poison effects "bleed through" to related concepts, and multiple attacks can composed together in a single prompt. Surprisingly, we show that a moderate number of Nightshade attacks can destabilize general features in a text-to-image generative model, effectively disabling its ability to generate meaningful images. Finally, we propose the use of Nightshade and similar tools as a last defense for content creators against web scrapers that ignore opt-out/do-not-crawl directives, and discuss possible implications for model trainers and content creators.
Paper Structure (32 sections, 2 equations, 25 figures, 9 tables)

This paper contains 32 sections, 2 equations, 25 figures, 9 tables.

Figures (25)

  • Figure 1: Overview of prompt-specific poison attacks against generic text-to-image generative models. (a) User generates poison data (text and image pairs) designed to corrupt a given concept $C$ (i.e. a keyword like "dog"), then posts them online; (b) Model trainer scrapes data from online webpages to train its generative model; c) Given prompts that contain $C$, poisoned model generates incorrect images.
  • Figure 2: Concept sparsity in LAION-Aesthetic measured by word and semantic frequencies. Note the long-tail distribution and log-scale on both Y axes.
  • Figure 3: Samples of dirty-label poison data in terms of mismatched text/image pairs, curated to attack the concept "dog." Here "cat" was chosen by the attacker as the destination concept $\mathcal{A}$.
  • Figure 4: Example images generated by the clean (unpoisoned) and poisoned SD-XL models with different # of poison data. The attack effect is apparent with 1000 poisoning samples, but not at 500 samples.
  • Figure 5: An illustrative example of Nightshade's curation of poison data to attack the concept "dog" using "cat". The anchor images (right) are generated by prompting "a photo of cat" on the clean SD-XL model multiple times. The poison images (middle) are perturbed versions of natural images of "dog", which resemble the anchor images in feature representation.
  • ...and 20 more figures