Table of Contents
Fetching ...

T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation

Zi-Ao Ma, Tian Lan, Rong-Cheng Tu, Shu-Hang Liu, Heyan Huang, Zhijing Wu, Chen Xu, Xian-Ling Mao

TL;DR

This work addresses the need for interpretable automatic evaluation of text-to-image generation by proposing T2I-Eval-R1, a reinforcement learning framework that trains open-source multimodal LLMs using coarse-quality scores. By embedding Group Relative Policy Optimization (GRPO) and a continuous reward scheme, the approach yields scalar evaluations plus coherent chain-of-thought rationales without expensive fine-grained annotations. It builds two training corpora (single-wise and pairwise) and evaluates on three benchmarks (T2I-Eval, TIFA v1.0, ImageReward), achieving state-of-the-art correlations with human judgments and demonstrating superior interpretability versus baselines including GPT-4o-based methods. The results show strong in-domain performance, solid generalization to unseen criteria, and robust rationale quality, pointing to a scalable, transparent path for evaluating diffusion-based T2I systems while noting remaining gaps in appearance-quality fidelity and broader dimension coverage.

Abstract

The rapid progress in diffusion-based text-to-image (T2I) generation has created an urgent need for interpretable automatic evaluation methods that can assess the quality of generated images, therefore reducing the human annotation burden. To reduce the prohibitive cost of relying on commercial models for large-scale evaluation, and to improve the reasoning capabilities of open-source models, recent research has explored supervised fine-tuning (SFT) of multimodal large language models (MLLMs) as dedicated T2I evaluators. However, SFT approaches typically rely on high-quality critique datasets, which are either generated by proprietary LLMs-with potential issues of bias and inconsistency-or annotated by humans at high cost, limiting their scalability and generalization. To address these limitations, we propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores, thereby avoiding the need for annotating high-quality interpretable evaluation rationale. Our approach integrates Group Relative Policy Optimization (GRPO) into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains with only easy accessible annotated judgment scores or preferences. Furthermore, we introduce a continuous reward formulation that encourages score diversity and provides stable optimization signals, leading to more robust and discriminative evaluation behavior. Experimental results on three established T2I meta-evaluation benchmarks demonstrate that T2I-Eval-R1 achieves significantly higher alignment with human assessments and offers more accurate interpretable score rationales compared to strong baseline methods.

T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation

TL;DR

This work addresses the need for interpretable automatic evaluation of text-to-image generation by proposing T2I-Eval-R1, a reinforcement learning framework that trains open-source multimodal LLMs using coarse-quality scores. By embedding Group Relative Policy Optimization (GRPO) and a continuous reward scheme, the approach yields scalar evaluations plus coherent chain-of-thought rationales without expensive fine-grained annotations. It builds two training corpora (single-wise and pairwise) and evaluates on three benchmarks (T2I-Eval, TIFA v1.0, ImageReward), achieving state-of-the-art correlations with human judgments and demonstrating superior interpretability versus baselines including GPT-4o-based methods. The results show strong in-domain performance, solid generalization to unseen criteria, and robust rationale quality, pointing to a scalable, transparent path for evaluating diffusion-based T2I systems while noting remaining gaps in appearance-quality fidelity and broader dimension coverage.

Abstract

The rapid progress in diffusion-based text-to-image (T2I) generation has created an urgent need for interpretable automatic evaluation methods that can assess the quality of generated images, therefore reducing the human annotation burden. To reduce the prohibitive cost of relying on commercial models for large-scale evaluation, and to improve the reasoning capabilities of open-source models, recent research has explored supervised fine-tuning (SFT) of multimodal large language models (MLLMs) as dedicated T2I evaluators. However, SFT approaches typically rely on high-quality critique datasets, which are either generated by proprietary LLMs-with potential issues of bias and inconsistency-or annotated by humans at high cost, limiting their scalability and generalization. To address these limitations, we propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores, thereby avoiding the need for annotating high-quality interpretable evaluation rationale. Our approach integrates Group Relative Policy Optimization (GRPO) into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains with only easy accessible annotated judgment scores or preferences. Furthermore, we introduce a continuous reward formulation that encourages score diversity and provides stable optimization signals, leading to more robust and discriminative evaluation behavior. Experimental results on three established T2I meta-evaluation benchmarks demonstrate that T2I-Eval-R1 achieves significantly higher alignment with human assessments and offers more accurate interpretable score rationales compared to strong baseline methods.

Paper Structure

This paper contains 43 sections, 8 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Comparison of our method with representative baselines in the results of GPT-4o-based meta-evaluation for interpretable evaluation.
  • Figure 2: Prompt template for T2I-Eval-R1 evaluators. The evaluation prompt is dynamically assembled with this template according to requirements of the specific task. The adjustable parts for single-wise and pairwise protocols are placed on the left and right side, respectively.
  • Figure 3: Prompt template for Appearance Quality evaluation from T2I-Eval dataset.
  • Figure 4: Prompt template for Intrnsic Attribute Consistency evaluation from T2I-Eval dataset.
  • Figure 5: Prompt template for Relationship Attribute Consistency evaluation from T2I-Eval dataset.
  • ...and 9 more figures