Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation
Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, Shin'ichi Satoh
TL;DR
This paper tackles the unreliability and non-reproducibility of human evaluation in text-to-image generation by proposing a standardized, absolute-rating protocol focused on fidelity and alignment, grounded in crowdsourcing with rigorous annotator qualification and transparent reporting. Through pilot studies and large-scale evaluations across MS-COCO, DrawBench, and PartiPrompts, it demonstrates that automatic metrics like FID and CLIPScore often diverge from human judgments and can be unreliable indicators of perceptual quality. The authors provide open-source tooling, a reporting template, and per-image human ratings to enhance verifiability and facilitate cross-study comparability, while outlining guidelines for crowdsourcing practices and acknowledging limitations and future directions. Overall, the work advocates for continuous refinement of evaluation standards to keep pace with advances in generation models and to enable more trustworthy model comparisons.
Abstract
Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations.
