Table of Contents
Fetching ...

Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

Ali Naseh, Anshuman Suri, Yuefeng Peng, Harsh Chaudhari, Alina Oprea, Amir Houmansadr

TL;DR

This work reveals a security vulnerability in text-to-image leaderboards: models leave detectable signatures in generated images that enable rapid deanonymization. By aggregating over 150,000 images from 19 models and 280 prompts, the authors show that simple CLIP-embedding–based centroids can identify the generating model with approximately 87% top-1 accuracy (and ~95% top-3), even without prompt control. They introduce a prompt-level distinguishability metric and demonstrate that some prompts yield perfectly separable model clusters, enabling near-perfect deanonymization and enabling further targeted attacks. The results underscore the need for stronger defenses on leaderboards, such as monitoring voting patterns or restricting generation data, to prevent rank manipulation and preserve credibility in benchmark competition.

Abstract

Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is rank manipulation, where an attacker must first deanonymize the models behind displayed outputs -- a threat previously demonstrated and explored for large language models (LLMs). We show that this problem can be even more severe for text-to-image leaderboards, where deanonymization is markedly easier. Using over 150,000 generated images from 280 prompts and 19 diverse models spanning multiple organizations, architectures, and sizes, we demonstrate that simple real-time classification in CLIP embedding space identifies the generating model with high accuracy, even without prompt control or historical data. We further introduce a prompt-level separability metric and identify prompts that enable near-perfect deanonymization. Our results indicate that rank manipulation in text-to-image leaderboards is easier than previously recognized, underscoring the need for stronger defenses.

Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

TL;DR

This work reveals a security vulnerability in text-to-image leaderboards: models leave detectable signatures in generated images that enable rapid deanonymization. By aggregating over 150,000 images from 19 models and 280 prompts, the authors show that simple CLIP-embedding–based centroids can identify the generating model with approximately 87% top-1 accuracy (and ~95% top-3), even without prompt control. They introduce a prompt-level distinguishability metric and demonstrate that some prompts yield perfectly separable model clusters, enabling near-perfect deanonymization and enabling further targeted attacks. The results underscore the need for stronger defenses on leaderboards, such as monitoring voting patterns or restricting generation data, to prevent rank manipulation and preserve credibility in benchmark competition.

Abstract

Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is rank manipulation, where an attacker must first deanonymize the models behind displayed outputs -- a threat previously demonstrated and explored for large language models (LLMs). We show that this problem can be even more severe for text-to-image leaderboards, where deanonymization is markedly easier. Using over 150,000 generated images from 280 prompts and 19 diverse models spanning multiple organizations, architectures, and sizes, we demonstrate that simple real-time classification in CLIP embedding space identifies the generating model with high accuracy, even without prompt control or historical data. We further introduce a prompt-level separability metric and identify prompts that enable near-perfect deanonymization. Our results indicate that rank manipulation in text-to-image leaderboards is easier than previously recognized, underscoring the need for stronger defenses.

Paper Structure

This paper contains 26 sections, 2 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Model-specific generation patterns for a fixed prompt. Each row shows five images from one model with different seeds, showing low intra-model diversity and strong inter-model differences.
  • Figure 2: Deanonymization accuracy versus number of generations $k$ per (prompt, model) pair. Curves show mean Top-1–Top-5 accuracy over five runs with one-standard-deviation error bars. The dashed line indicates the random-guess baseline of $1/19$.
  • Figure 3: CLIP-embedding visualizations for two representative prompts with contrasting distinguishability scores. Left: a high-distinguishability prompt (score $=1.0$), where generations from every model form clearly separated clusters. Right: a low-distinguishability prompt (score $=0.21$), where generations from different models overlap substantially, making deanonymization harder.
  • Figure 4: Distribution of the distinguishability score over the evaluation prompts.
  • Figure 5: Relationship between prompt-level distinguishability and deanonymization accuracy. Each point represents an evaluation prompt, and the curve shows that higher distinguishability scores lead to consistently higher top-1 deanonymization accuracy. This confirms that the distinguishability metric is a strong predictor of attack success.