Table of Contents
Fetching ...

T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

Kaiyue Sun, Rongyao Fang, Chengqi Duan, Xian Liu, Xihui Liu

TL;DR

The paper introduces T2I-ReasonBench, a reasoning-focused benchmark for text-to-image generation along four dimensions—Idioms, Textual Image Design, Entity Reasoning, and Scientific Reasoning—paired with a two-stage evaluation framework using LLMs and MLLMs to assess reasoning accuracy and image quality.It situates the benchmark within the current T2I landscape, arguing that existing datasets emphasize literal prompt-image alignment and fail to test deeper reasoning and knowledge integration required for complex scenes.Through a comprehensive evaluation of 14 state-of-the-art models (diffusion, autoregressive, and proprietary), the study reveals notable gaps in open-source models compared to proprietary systems, and shows that prompting strategies involving external reasoning can substantially improve performance.The work highlights the potential for combining explicit reasoning modules with generation and suggests future directions involving knowledge bases and broader reasoning tasks while underscoring ethical considerations around misuse of image synthesis.

Abstract

We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.

T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

TL;DR

The paper introduces T2I-ReasonBench, a reasoning-focused benchmark for text-to-image generation along four dimensions—Idioms, Textual Image Design, Entity Reasoning, and Scientific Reasoning—paired with a two-stage evaluation framework using LLMs and MLLMs to assess reasoning accuracy and image quality.It situates the benchmark within the current T2I landscape, arguing that existing datasets emphasize literal prompt-image alignment and fail to test deeper reasoning and knowledge integration required for complex scenes.Through a comprehensive evaluation of 14 state-of-the-art models (diffusion, autoregressive, and proprietary), the study reveals notable gaps in open-source models compared to proprietary systems, and shows that prompting strategies involving external reasoning can substantially improve performance.The work highlights the potential for combining explicit reasoning modules with generation and suggests future directions involving knowledge bases and broader reasoning tasks while underscoring ethical considerations around misuse of image synthesis.

Abstract

We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.

Paper Structure

This paper contains 16 sections, 5 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Overview of T2I-ReasonBench. We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.
  • Figure 2: Left: Prompt collection process. Middle: Subcategories in the four evaluation dimensions. Right: Prompt Suite Statistics.
  • Figure 3: Word cloud to visualize the word distribution of each dimension in our prompt suite.
  • Figure 4: Evaluation Framework of T2I-ReasonBench. We adopt a two-stage evaluation framework: prompt-specific evaluation question-criterion pairs generation by an LLM, then image analysis and scoring by an MLLM. This figure shows one evaluation example for each dimension.
  • Figure 5: Qualitative examples.