Table of Contents
Fetching ...

Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation

Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

TL;DR

The paper interrogates how well current evaluation metrics reflect human judgments in compositional text–image generation, using 12 metrics across 8 compositional categories on 2400 prompt–image samples. It jointly analyzes correlations with human scores, regression contributions, and score distributions to reveal that no single metric consistently captures human preferences; embedding-based metrics (e.g., ImageReward, HPS) and VQA-based metrics (e.g., DA Score, VQA Score) exhibit complementary strengths across tasks, while image-only metrics perform poorly for alignment. The results underscore the limitations of relying on a single metric and advocate for combining complementary signals with transparent reporting to guide trustworthy evaluation and reward design. These insights offer practical guidance for benchmarking and improving text-to-image systems, particularly in how metrics should be selected and used in both evaluation and generation pipelines.

Abstract

Text-image generation has advanced rapidly, but assessing whether outputs truly capture the objects, attributes, and relations described in prompts remains a central challenge. Evaluation in this space relies heavily on automated metrics, yet these are often adopted by convention or popularity rather than validated against human judgment. Because evaluation and reported progress in the field depend directly on these metrics, it is critical to understand how well they reflect human preferences. To address this, we present a broad study of widely used metrics for compositional text-image evaluation. Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges and comparing how different metric families align with human judgments. The results show that no single metric performs consistently across tasks: performance varies with the type of compositional problem. Notably, VQA-based metrics, though popular, are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics, as expected, contribute little to compositional evaluation, as they are designed for perceptual quality rather than alignment. These findings underscore the importance of careful and transparent metric selection, both for trustworthy evaluation and for their use as reward models in generation. Project page is available at https://amirkasaei.com/eval-the-evals/ .

Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation

TL;DR

The paper interrogates how well current evaluation metrics reflect human judgments in compositional text–image generation, using 12 metrics across 8 compositional categories on 2400 prompt–image samples. It jointly analyzes correlations with human scores, regression contributions, and score distributions to reveal that no single metric consistently captures human preferences; embedding-based metrics (e.g., ImageReward, HPS) and VQA-based metrics (e.g., DA Score, VQA Score) exhibit complementary strengths across tasks, while image-only metrics perform poorly for alignment. The results underscore the limitations of relying on a single metric and advocate for combining complementary signals with transparent reporting to guide trustworthy evaluation and reward design. These insights offer practical guidance for benchmarking and improving text-to-image systems, particularly in how metrics should be selected and used in both evaluation and generation pipelines.

Abstract

Text-image generation has advanced rapidly, but assessing whether outputs truly capture the objects, attributes, and relations described in prompts remains a central challenge. Evaluation in this space relies heavily on automated metrics, yet these are often adopted by convention or popularity rather than validated against human judgment. Because evaluation and reported progress in the field depend directly on these metrics, it is critical to understand how well they reflect human preferences. To address this, we present a broad study of widely used metrics for compositional text-image evaluation. Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges and comparing how different metric families align with human judgments. The results show that no single metric performs consistently across tasks: performance varies with the type of compositional problem. Notably, VQA-based metrics, though popular, are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics, as expected, contribute little to compositional evaluation, as they are designed for perceptual quality rather than alignment. These findings underscore the importance of careful and transparent metric selection, both for trustworthy evaluation and for their use as reward models in generation. Project page is available at https://amirkasaei.com/eval-the-evals/ .

Paper Structure

This paper contains 16 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Value distributions of all analyzed metrics over T2I-CompBench++ generations (bin counts normalized by frequency).