Table of Contents
Fetching ...

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Xiaojuan Qi, Fuli Feng

TL;DR

This work introduces T2I-CoReBench, a comprehensive benchmark that jointly evaluates composition and reasoning in text-to-image generation across 12 dimensions, featuring high compositional density and complex inference. It combines a prompt–checklist paradigm with automatic MLLM-based evaluation, totaling 1,080 prompts and ~13,536 checks, evaluated over 28 configurations. Findings show steady gains in composition but substantial bottlenecks in reasoning, with prompt rewriting offering gains for weaker entrants but not solving the reasoning gap, highlighting the need for multimodal reasoning integration and encoder-oriented improvements. The framework and results offer a path toward T2I systems that can both faithfully set the stage and directly drive the imagined scene.

Abstract

Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, which thus correspond to two core capabilities: composition and reasoning. Despite recent advances of T2I models in both composition and reasoning, existing benchmarks remain limited in evaluation. They not only fail to provide comprehensive coverage across and within both capabilities, but also largely restrict evaluation to low scene density and simple one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent real-world complexities, we curate each prompt with higher compositional density for composition and greater reasoning intensity for reasoning. To facilitate fine-grained and reliable evaluation, we also pair each evaluation prompt with a checklist that specifies individual yes/no questions to assess each intended element independently. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 28 current T2I models reveal that their composition capability still remains limited in high compositional scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts.

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

TL;DR

This work introduces T2I-CoReBench, a comprehensive benchmark that jointly evaluates composition and reasoning in text-to-image generation across 12 dimensions, featuring high compositional density and complex inference. It combines a prompt–checklist paradigm with automatic MLLM-based evaluation, totaling 1,080 prompts and ~13,536 checks, evaluated over 28 configurations. Findings show steady gains in composition but substantial bottlenecks in reasoning, with prompt rewriting offering gains for weaker entrants but not solving the reasoning gap, highlighting the need for multimodal reasoning integration and encoder-oriented improvements. The framework and results offer a path toward T2I systems that can both faithfully set the stage and directly drive the imagined scene.

Abstract

Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, which thus correspond to two core capabilities: composition and reasoning. Despite recent advances of T2I models in both composition and reasoning, existing benchmarks remain limited in evaluation. They not only fail to provide comprehensive coverage across and within both capabilities, but also largely restrict evaluation to low scene density and simple one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent real-world complexities, we curate each prompt with higher compositional density for composition and greater reasoning intensity for reasoning. To facilitate fine-grained and reliable evaluation, we also pair each evaluation prompt with a checklist that specifies individual yes/no questions to assess each intended element independently. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 28 current T2I models reveal that their composition capability still remains limited in high compositional scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts.

Paper Structure

This paper contains 26 sections, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Overview of our T2I-CoReBench. (a) Our benchmark comprehensively covers two fundamental T2I capabilities (i.e., composition and reasoning), further refined into 12 dimensions. (b-e) Our benchmark poses greater challenges to advanced T2I models, with higher compositional density than DPG-Bench hu2024ella and greater reasoning intensity than R2I-Bench chen2025r2i, enabling clearer performance differentiation across models under real-world complexities. Each image is scored based on the ratio of correctly generated elements.
  • Figure 2: Overview of our T2I-CoReBench pipeline.
  • Figure 3: Examples from T2I-CoReBench illustrating (a) composition and (b-d) reasoning capabilities across 12 dimensions (see Appx. \ref{['appx:quantitative_examples_and_comparisons']} for complete versions). Each dimension is designed to incorporate complexity tailored to its unique characteristics, allowing more challenging evaluation under real-world scenarios, and supports fine-grained evaluation with human-verified checklists.
  • Figure 4: Statistics of our T2I-CoReBench showing (a) prompt-token lengths and (b) checklist-question counts. Our benchmark exhibits high complexity in both composition and reasoning capabilities, with an average prompt length of 170 tokens and an average of 12.5 questions per sample.
  • Figure 5: Qualitative examples before and after prompt rewriting. In some reasoning dimensions (e.g., LR), the primary challenge lies in textual reasoning, and prompt rewriting is highly effective. However, tasks such as transforming wheels into squares in HR remain difficult even after prompt rewriting, indicating that textual reasoning alone is insufficient and other mechanisms are required.
  • ...and 7 more figures