Table of Contents
Fetching ...

One Prompt to Verify Your Models: Black-Box Text-to-Image Models Verification via Non-Transferable Adversarial Attacks

Ji Guo, Wenbo Jiang, Rui Zhang, Guoming Lu, Hongwei Li

TL;DR

This work tackles the practical problem of verifying that black-box text-to-image APIs actually implement the claimed models. It introduces TVN (Text-to-Image Models Verification via Non-Transferable Adversarial Attacks), which generates non-transferable prompts using NSGA-II to force the target model to produce images that differ from those produced by other models, enabling model identification via CLIP-text score thresholds. The method achieves over 90% verification accuracy across diverse T2I models and demonstrates effectiveness on third-party platforms like Hugging Face. By integrating a 3-sigma threshold on CLIP-text scores and a multi-objective perturbation strategy, TVN provides a practical, scalable approach for auditing model claims in open and closed-set settings with real-world applicability.

Abstract

Recently, various types of Text-to-Image (T2I) models have emerged (such as DALL-E and Stable Diffusion), and showing their advantages in different aspects. Therefore, some third-party service platforms collect different model interfaces and provide cheaper API services and more flexibility in T2I model selections. However, this also raises a new security concern: Are these third-party services truly offering the models they claim? To answer this question, we first define the concept of T2I model verification, which aims to determine whether a black-box target model is identical to a given white-box reference T2I model. After that, we propose VerifyPrompt, which performs T2I model verification through a special designed verify prompt. Intuitionally, the verify prompt is an adversarial prompt for the target model without transferability for other models. It makes the target model generate a specific image while making other models produce entirely different images. Specifically, VerifyPrompt utilizes the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to optimize the cosine similarity of a prompt's text encoding, generating verify prompts. Finally, by computing the CLIP-text similarity scores between the prompts the generated images, VerifyPrompt can determine whether the target model aligns with the reference model. Experimental results demonstrate that VerifyPrompt consistently achieves over 90\% accuracy across various T2I models, confirming its effectiveness in practical model platforms (such as Hugging Face).

One Prompt to Verify Your Models: Black-Box Text-to-Image Models Verification via Non-Transferable Adversarial Attacks

TL;DR

This work tackles the practical problem of verifying that black-box text-to-image APIs actually implement the claimed models. It introduces TVN (Text-to-Image Models Verification via Non-Transferable Adversarial Attacks), which generates non-transferable prompts using NSGA-II to force the target model to produce images that differ from those produced by other models, enabling model identification via CLIP-text score thresholds. The method achieves over 90% verification accuracy across diverse T2I models and demonstrates effectiveness on third-party platforms like Hugging Face. By integrating a 3-sigma threshold on CLIP-text scores and a multi-objective perturbation strategy, TVN provides a practical, scalable approach for auditing model claims in open and closed-set settings with real-world applicability.

Abstract

Recently, various types of Text-to-Image (T2I) models have emerged (such as DALL-E and Stable Diffusion), and showing their advantages in different aspects. Therefore, some third-party service platforms collect different model interfaces and provide cheaper API services and more flexibility in T2I model selections. However, this also raises a new security concern: Are these third-party services truly offering the models they claim? To answer this question, we first define the concept of T2I model verification, which aims to determine whether a black-box target model is identical to a given white-box reference T2I model. After that, we propose VerifyPrompt, which performs T2I model verification through a special designed verify prompt. Intuitionally, the verify prompt is an adversarial prompt for the target model without transferability for other models. It makes the target model generate a specific image while making other models produce entirely different images. Specifically, VerifyPrompt utilizes the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to optimize the cosine similarity of a prompt's text encoding, generating verify prompts. Finally, by computing the CLIP-text similarity scores between the prompts the generated images, VerifyPrompt can determine whether the target model aligns with the reference model. Experimental results demonstrate that VerifyPrompt consistently achieves over 90\% accuracy across various T2I models, confirming its effectiveness in practical model platforms (such as Hugging Face).

Paper Structure

This paper contains 22 sections, 10 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: The scenario of model verification
  • Figure 2: Example of a non-transferable adversarial prompt
  • Figure 3: The process of the model verification
  • Figure 4: Visualization of the update process of a non-transferable adversarial sample. Adversarial attacks tend to update toward the adversarial region center of the target model, which is likely to overlap with the adversarial regions of other models, leading to transferability. In contrast, introducing non-transferable adversarial examples pushes them away from the centers of other models, making them effective only for the target model.
  • Figure 5: The workflow of TVN
  • ...and 7 more figures