Table of Contents
Fetching ...

GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation

Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, Marjan Ghazvininejad

TL;DR

The paper analyzes benchmark drift in GenEval, revealing substantial misalignment with human judgments as T2I models progress and prompts saturate ($\$17.7\%$). It introduces GenEval 2 with expanded primitive coverage and higher compositionality to better challenge current models, and Soft-TIFA, an open, atom- and prompt-level evaluation method using an open-source VQA model. Soft-TIFA demonstrates stronger human-alignment than prior metrics and shows reduced sensitivity to distribution shifts over time, addressing drift more effectively. Together, GenEval 2 and Soft-TIFA provide a more durable framework for evaluating evolving T2I capabilities, while underscoring the need for continual benchmarking audits to maintain validity across model generations.

Abstract

Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.

GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation

TL;DR

The paper analyzes benchmark drift in GenEval, revealing substantial misalignment with human judgments as T2I models progress and prompts saturate (17.7\%$). It introduces GenEval 2 with expanded primitive coverage and higher compositionality to better challenge current models, and Soft-TIFA, an open, atom- and prompt-level evaluation method using an open-source VQA model. Soft-TIFA demonstrates stronger human-alignment than prior metrics and shows reduced sensitivity to distribution shifts over time, addressing drift more effectively. Together, GenEval 2 and Soft-TIFA provide a more durable framework for evaluating evolving T2I capabilities, while underscoring the need for continual benchmarking audits to maintain validity across model generations.

Abstract

Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.

Paper Structure

This paper contains 34 sections, 3 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: With the distribution shift of Text-to-Image (T2I) models' outputs over time, we reveal that the model-based evaluation of GenEval decreases in human-alignment, masking the fact that the benchmark is now saturated. We introduce GenEval 2, a more robust benchmark that is challenging for state-of-the-art T2I models, alongside an evaluation method, Soft-TIFA, that is less likely to suffer benchmark drift.
  • Figure 2: Net deviation in reported score from human score on GenEval has increased significantly over time. The models on the X-axis are arranged by release date.
  • Figure 3: We present GenEval 2, a T2I benchmark testing basic capabilities and increasing compositionality. Some samples are shown above. A prompt is considered correctly generated if all component atoms are correctly generated. We show some samples of T2I model outputs on the benchmark, as well as prompt-level annotations for all models and atom-level annotations for Gemini 2.5 Flash Image.
  • Figure 4: GenEval 2 enables various analyses of T2I models: (a) while state-of-the-art (SOTA) T2I models perform well at generating objects, and quite well at assigning them attributes, they struggle with counting, spatial relations, and transitive verb relations; (b) SOTA T2I model performance drops sharply as prompts become more complex. Per-model analyses are provided in Appendix \ref{['sec:app_per_model']}.
  • Figure 5: Under all VQA models, Soft-TIFA$_\text{GM}$ achieves higher human alignment on GenEval 2 across T2I models than VQAScore. Further, it is more robust to T2I output distribution shift over time, potentially because breaking the prompt into per-atom questions renders the VQA model more robust to image distribution shift in T2I outputs.
  • ...and 9 more figures