Table of Contents
Fetching ...

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

Michael Saxon, Fatima Jahara, Mahsa Khoshnoodi, Yujie Lu, Aditya Sharma, William Yang Wang

TL;DR

T2IScoreScore is introduced, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests.

Abstract

With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness -- the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-language models (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented with correlation to human Likert scores over a set of easy-to-discriminate images against seemingly weak baselines. We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMScore, VIEScore) we tested fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria.

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

TL;DR

T2IScoreScore is introduced, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests.

Abstract

With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness -- the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-language models (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented with correlation to human Likert scores over a set of easy-to-discriminate images against seemingly weak baselines. We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMScore, VIEScore) we tested fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria.
Paper Structure (42 sections, 7 equations, 13 figures, 3 tables)

This paper contains 42 sections, 7 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Overview of T2IScoreScore68,67,14761,130,217. T2I evaluation metrics are scored based on their ability to correctly organize images in a semantic error graph (SEG) relative to their generating prompt, checking ordering (Spearman's $\rho$) and separation of nodes (Kolmogorov--Smirnov statistic).
  • Figure 2: The three semantic error graph production procedures. Synth. (images generated from multiple prompts written to populate a SEG), Nat. (natural images populate a SEG), and Real (real errors from image generation attempts from one prompt populate a SEG).
  • Figure 3: Overview of the distribution of sample types in TS268,67,14761,130,217: (a) Where source images came from: 5% of images in the benchmark are real photographs from Pexels, while the remainder were generated by Stable Diffusion (SD) or DALL-E variants. (b) Source of the eliciting prompt; either existing resources or us (Manual). (c) Distribution of error types edges in all SEGs.
  • Figure 4: Plots of ordering and separation scores against estimated per-image metric evaluation costs in FLOPs and each other. For all analyses, the Pareto optimal metrics are DSG and TIFA with GPT-4, and the vastly less expensive embedding-correlation ALIGNScore and CLIPScore.
  • Figure 5: Example of a SEG (85) with a more complex structure. Some nodes have multiple child nodes, and some edges correspond to more than one error (dark red).
  • ...and 8 more figures