Table of Contents
Fetching ...

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, Shin'ichi Satoh

TL;DR

This paper tackles the unreliability and non-reproducibility of human evaluation in text-to-image generation by proposing a standardized, absolute-rating protocol focused on fidelity and alignment, grounded in crowdsourcing with rigorous annotator qualification and transparent reporting. Through pilot studies and large-scale evaluations across MS-COCO, DrawBench, and PartiPrompts, it demonstrates that automatic metrics like FID and CLIPScore often diverge from human judgments and can be unreliable indicators of perceptual quality. The authors provide open-source tooling, a reporting template, and per-image human ratings to enhance verifiability and facilitate cross-study comparability, while outlining guidelines for crowdsourcing practices and acknowledging limitations and future directions. Overall, the work advocates for continuous refinement of evaluation standards to keep pace with advances in generation models and to enable more trustworthy model comparisons.

Abstract

Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations.

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

TL;DR

This paper tackles the unreliability and non-reproducibility of human evaluation in text-to-image generation by proposing a standardized, absolute-rating protocol focused on fidelity and alignment, grounded in crowdsourcing with rigorous annotator qualification and transparent reporting. Through pilot studies and large-scale evaluations across MS-COCO, DrawBench, and PartiPrompts, it demonstrates that automatic metrics like FID and CLIPScore often diverge from human judgments and can be unreliable indicators of perceptual quality. The authors provide open-source tooling, a reporting template, and per-image human ratings to enhance verifiability and facilitate cross-study comparability, while outlining guidelines for crowdsourcing practices and acknowledging limitations and future directions. Overall, the work advocates for continuous refinement of evaluation standards to keep pace with advances in generation models and to enable more trustworthy model comparisons.

Abstract

Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations.
Paper Structure (19 sections, 7 figures, 1 table)

This paper contains 19 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Conventionally, researchers have used different protocols for human evaluation, and setup details are often unclear. We aim to build a standardized human evaluation.
  • Figure 2: Question and labels of two candidate task designs. A uses typical labels for a 5-point Likert scale. B's labels are more precise.
  • Figure 3: Generated or real images and their human ratings of fidelity and alignment. The scores are the mean of three annotators' ratings.
  • Figure 4: Examples of input captions and generated images.
  • Figure 5: Below each caption, a real image and two generated images using the caption are displayed. Their sample-level CLIPScore is displayed in the bottom right of each. LAFITE achieves high scores, which are counter-intuitive.
  • ...and 2 more figures