One Prompt to Verify Your Models: Black-Box Text-to-Image Models Verification via Non-Transferable Adversarial Attacks
Ji Guo, Wenbo Jiang, Rui Zhang, Guoming Lu, Hongwei Li
TL;DR
This work tackles the practical problem of verifying that black-box text-to-image APIs actually implement the claimed models. It introduces TVN (Text-to-Image Models Verification via Non-Transferable Adversarial Attacks), which generates non-transferable prompts using NSGA-II to force the target model to produce images that differ from those produced by other models, enabling model identification via CLIP-text score thresholds. The method achieves over 90% verification accuracy across diverse T2I models and demonstrates effectiveness on third-party platforms like Hugging Face. By integrating a 3-sigma threshold on CLIP-text scores and a multi-objective perturbation strategy, TVN provides a practical, scalable approach for auditing model claims in open and closed-set settings with real-world applicability.
Abstract
Recently, various types of Text-to-Image (T2I) models have emerged (such as DALL-E and Stable Diffusion), and showing their advantages in different aspects. Therefore, some third-party service platforms collect different model interfaces and provide cheaper API services and more flexibility in T2I model selections. However, this also raises a new security concern: Are these third-party services truly offering the models they claim? To answer this question, we first define the concept of T2I model verification, which aims to determine whether a black-box target model is identical to a given white-box reference T2I model. After that, we propose VerifyPrompt, which performs T2I model verification through a special designed verify prompt. Intuitionally, the verify prompt is an adversarial prompt for the target model without transferability for other models. It makes the target model generate a specific image while making other models produce entirely different images. Specifically, VerifyPrompt utilizes the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to optimize the cosine similarity of a prompt's text encoding, generating verify prompts. Finally, by computing the CLIP-text similarity scores between the prompts the generated images, VerifyPrompt can determine whether the target model aligns with the reference model. Experimental results demonstrate that VerifyPrompt consistently achieves over 90\% accuracy across various T2I models, confirming its effectiveness in practical model platforms (such as Hugging Face).
