Table of Contents
Fetching ...

Prompt Performance Prediction for Image Generation

Nicolas Bizzozzero, Ihab Bendidi, Olivier Risser-Maroix

TL;DR

Prompt Performance Prediction (PPP) tackles predicting a prompt's effectiveness for image generation before results are produced, formalized as predicting $R(T_i)=f_{\theta}(T_i)+\epsilon$ with $\epsilon \sim \mathcal{N}(0,\sigma^2)$. The study evaluates PPP across three prompt-image datasets and three art-domain datasets, using ground-truth relevance proxies from six pre-trained predictors and CLIP-based representations, with $R(T_i)=\frac{1}{J}\sum_{j=1}^J R(I_{i,j})$ and parameter learning via likelihood $\theta^* = \arg\max_{\theta} \sum_i \log p(R_i|T_i;\theta)$. CLIP-based textual embeddings deliver the strongest PPP signals, outperforming other language models, though a modality gap between image and text representations can limit cross-modal predictions. The results demonstrate PPP's potential to guide prompt design and reformulation in generative IR, enabling proactive optimization and resource-efficient content creation, with implications for user experience and model feedback.

Abstract

The ability to predict the performance of a query before results are returned has been a longstanding challenge in Information Retrieval (IR) systems. Inspired by this task, we introduce, in this paper, a novel task called "Prompt Performance Prediction" (PPP) that aims to predict the performance of a prompt, before obtaining the actual generated images. We demonstrate the plausibility of our task by measuring the correlation coefficient between predicted and actual performance scores across: three datasets containing pairs of prompts and generated images AND three art domain datasets of real images and real user appreciation ratings. Our results show promising performance prediction capabilities, suggesting potential applications for optimizing user prompts.

Prompt Performance Prediction for Image Generation

TL;DR

Prompt Performance Prediction (PPP) tackles predicting a prompt's effectiveness for image generation before results are produced, formalized as predicting with . The study evaluates PPP across three prompt-image datasets and three art-domain datasets, using ground-truth relevance proxies from six pre-trained predictors and CLIP-based representations, with and parameter learning via likelihood . CLIP-based textual embeddings deliver the strongest PPP signals, outperforming other language models, though a modality gap between image and text representations can limit cross-modal predictions. The results demonstrate PPP's potential to guide prompt design and reformulation in generative IR, enabling proactive optimization and resource-efficient content creation, with implications for user experience and model feedback.

Abstract

The ability to predict the performance of a query before results are returned has been a longstanding challenge in Information Retrieval (IR) systems. Inspired by this task, we introduce, in this paper, a novel task called "Prompt Performance Prediction" (PPP) that aims to predict the performance of a prompt, before obtaining the actual generated images. We demonstrate the plausibility of our task by measuring the correlation coefficient between predicted and actual performance scores across: three datasets containing pairs of prompts and generated images AND three art domain datasets of real images and real user appreciation ratings. Our results show promising performance prediction capabilities, suggesting potential applications for optimizing user prompts.
Paper Structure (9 sections, 1 equation, 3 figures, 5 tables)

This paper contains 9 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Principal components from PCA were computed on Clip-ViT-B-32 embeddings of prompts and images (Stable Diffusion). The first component distinctly captures the separation between these two modalities. One prompt can be linked to multiple generated images.
  • Figure 2: Pearson correlation between image relevance metrics on Dall-E 2: SAMPNet (compositionality), ResMem and ViTMem (memorability), and others (aesthetic). All correlations are statistically significant ($p$-value $< 0.01$)
  • Figure 3: A reconstructed prompt generated from a combination of different information types, including the caption, painter and epoch, and valence. The prompt is created based on a general and arbitrary template. When the caption information is missing, the generated caption is obtained using the BLIP Image Captioner, which takes the painting image as input. Other information, such as painter and epoch, is extracted from meta-data whenever available, or is inferred from the painting image using other models.