Table of Contents
Fetching ...

PromptLA: Towards Integrity Verification of Black-box Text-to-Image Diffusion Models

Zhuomeng Zhang, Fangqi Li, Chong Di, Hongyu Zhu, Hanyi Wang, Shilin Wang

TL;DR

This work tackles the problem of verifying the integrity of black-box T2I diffusion models by detecting model tampering through shifts in image-feature distributions measured with $D_{KL}(P\|Q)$. It introduces PromptLA, a learning-automaton-based prompt-selection algorithm that actively queries prompts to maximize discriminability while minimizing costs, using a relative KL-divergence metric to mitigate stochastic diffusion randomness. The method achieves a mean AUC above 0.95 across multiple integrity-violation scenarios and base models, with robustness to image-level post-processing and favorable efficiency compared to baselines. This framework provides a practical, scalable standard for integrity verification in AI-generated content, with potential applications in AI copyright litigation and automated model auditing, and it opens avenues for continuous-prompt optimization and broader generative tasks.

Abstract

Despite the impressive synthesis quality of text-to-image (T2I) diffusion models, their black-box deployment poses significant regulatory challenges: Malicious actors can fine-tune these models to generate illegal content, circumventing existing safeguards through parameter manipulation. Therefore, it is essential to verify the integrity of T2I diffusion models. To this end, considering the randomness within the outputs of generative models and the high costs in interacting with them, we discern model tampering via the KL divergence between the distributions of the features of generated images. We propose a novel prompt selection algorithm based on learning automaton (PromptLA) for efficient and accurate verification. Evaluations on four advanced T2I models (e.g., SDXL, FLUX.1) demonstrate that our method achieves a mean AUC of over 0.96 in integrity detection, exceeding baselines by more than 0.2, showcasing strong effectiveness and generalization. Additionally, our approach achieves lower cost and is robust against image-level post-processing. To the best of our knowledge, this paper is the first work addressing the integrity verification of T2I diffusion models, which establishes quantifiable standards for AI copyright litigation in practice.

PromptLA: Towards Integrity Verification of Black-box Text-to-Image Diffusion Models

TL;DR

This work tackles the problem of verifying the integrity of black-box T2I diffusion models by detecting model tampering through shifts in image-feature distributions measured with . It introduces PromptLA, a learning-automaton-based prompt-selection algorithm that actively queries prompts to maximize discriminability while minimizing costs, using a relative KL-divergence metric to mitigate stochastic diffusion randomness. The method achieves a mean AUC above 0.95 across multiple integrity-violation scenarios and base models, with robustness to image-level post-processing and favorable efficiency compared to baselines. This framework provides a practical, scalable standard for integrity verification in AI-generated content, with potential applications in AI copyright litigation and automated model auditing, and it opens avenues for continuous-prompt optimization and broader generative tasks.

Abstract

Despite the impressive synthesis quality of text-to-image (T2I) diffusion models, their black-box deployment poses significant regulatory challenges: Malicious actors can fine-tune these models to generate illegal content, circumventing existing safeguards through parameter manipulation. Therefore, it is essential to verify the integrity of T2I diffusion models. To this end, considering the randomness within the outputs of generative models and the high costs in interacting with them, we discern model tampering via the KL divergence between the distributions of the features of generated images. We propose a novel prompt selection algorithm based on learning automaton (PromptLA) for efficient and accurate verification. Evaluations on four advanced T2I models (e.g., SDXL, FLUX.1) demonstrate that our method achieves a mean AUC of over 0.96 in integrity detection, exceeding baselines by more than 0.2, showcasing strong effectiveness and generalization. Additionally, our approach achieves lower cost and is robust against image-level post-processing. To the best of our knowledge, this paper is the first work addressing the integrity verification of T2I diffusion models, which establishes quantifiable standards for AI copyright litigation in practice.

Paper Structure

This paper contains 25 sections, 7 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Integrity Violation Scenarios and proposed integrity verification framework of T2I diffusion models.
  • Figure 2: Comparison of images generated by the original model and after various integrity violations. For each prompt, the left column shows images generated with a fixed seed, and the right column shows randomly generated images.
  • Figure 3: tSNE of features extracted from generated images using the Inception-v3 model and prompt7 "abstract". (a) Comparison within the original model SD-v1.5. (b) Comparison between the original model and the model fine-tuned using LoRA.
  • Figure 4: The relative KL divergence differences in the distribution of features extracted from images generated by T2I diffusion models before and after various integrity violations, using different prompt. The size of prompt library is set to 50 (24 of them are shown here), which is generated by GPT-4. The original model is SD-v1.5. Different colors represent different integrity violations, the details of which refer to the experiment section. The distribution of generated images' features is estimated using 50 images each.
  • Figure 5: A visual instance of prompt selection using PromptLA against integrity violation LoRA1. Base model: SD-v1.5.
  • ...and 1 more figures