Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

Michael Saxon; Fatima Jahara; Mahsa Khoshnoodi; Yujie Lu; Aditya Sharma; William Yang Wang

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

Michael Saxon, Fatima Jahara, Mahsa Khoshnoodi, Yujie Lu, Aditya Sharma, William Yang Wang

TL;DR

T2IScoreScore is introduced, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests.

Abstract

With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness -- the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-language models (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented with correlation to human Likert scores over a set of easy-to-discriminate images against seemingly weak baselines. We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMScore, VIEScore) we tested fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria.

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

TL;DR

Abstract

Paper Structure (42 sections, 7 equations, 13 figures, 3 tables)

This paper contains 42 sections, 7 equations, 13 figures, 3 tables.

Introduction
Related Work
T2IScoreScore68,67,14761,130,217 meta-metrics
Ordering score over walks: $\mathtt{rank}_m$
Statistical separation of error populations score: $\mathtt{sep}_m$
Separation of nodes within dynamic range: $\mathtt{delta}_m$
The T2IScoreScore68,67,14761,130,217 Dataset
Dataset Collection Procedure
Synth.
Nat.
Real.
Dataset structure, size, and validity
Experiments
Embedding-correlation Metrics
Question Generation & Answering (QG/A) Metrics
...and 27 more sections

Figures (13)

Figure 1: Overview of T2IScoreScore68,67,14761,130,217. T2I evaluation metrics are scored based on their ability to correctly organize images in a semantic error graph (SEG) relative to their generating prompt, checking ordering (Spearman's $\rho$) and separation of nodes (Kolmogorov--Smirnov statistic).
Figure 2: The three semantic error graph production procedures. Synth. (images generated from multiple prompts written to populate a SEG), Nat. (natural images populate a SEG), and Real (real errors from image generation attempts from one prompt populate a SEG).
Figure 3: Overview of the distribution of sample types in TS268,67,14761,130,217: (a) Where source images came from: 5% of images in the benchmark are real photographs from Pexels, while the remainder were generated by Stable Diffusion (SD) or DALL-E variants. (b) Source of the eliciting prompt; either existing resources or us (Manual). (c) Distribution of error types edges in all SEGs.
Figure 4: Plots of ordering and separation scores against estimated per-image metric evaluation costs in FLOPs and each other. For all analyses, the Pareto optimal metrics are DSG and TIFA with GPT-4, and the vastly less expensive embedding-correlation ALIGNScore and CLIPScore.
Figure 5: Example of a SEG (85) with a more complex structure. Some nodes have multiple child nodes, and some edges correspond to more than one error (dark red).
...and 8 more figures

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

TL;DR

Abstract

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

Authors

TL;DR

Abstract

Table of Contents

Figures (13)