Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Ouxiang Li; Yuan Wang; Xinting Hu; Huijuan Huang; Rui Chen; Jiarong Ou; Xin Tao; Pengfei Wan; Xiaojuan Qi; Fuli Feng

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Xiaojuan Qi, Fuli Feng

TL;DR

This work introduces T2I-CoReBench, a comprehensive benchmark that jointly evaluates composition and reasoning in text-to-image generation across 12 dimensions, featuring high compositional density and complex inference. It combines a prompt–checklist paradigm with automatic MLLM-based evaluation, totaling 1,080 prompts and ~13,536 checks, evaluated over 28 configurations. Findings show steady gains in composition but substantial bottlenecks in reasoning, with prompt rewriting offering gains for weaker entrants but not solving the reasoning gap, highlighting the need for multimodal reasoning integration and encoder-oriented improvements. The framework and results offer a path toward T2I systems that can both faithfully set the stage and directly drive the imagined scene.

Abstract

Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, which thus correspond to two core capabilities: composition and reasoning. Despite recent advances of T2I models in both composition and reasoning, existing benchmarks remain limited in evaluation. They not only fail to provide comprehensive coverage across and within both capabilities, but also largely restrict evaluation to low scene density and simple one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent real-world complexities, we curate each prompt with higher compositional density for composition and greater reasoning intensity for reasoning. To facilitate fine-grained and reliable evaluation, we also pair each evaluation prompt with a checklist that specifies individual yes/no questions to assess each intended element independently. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 28 current T2I models reveal that their composition capability still remains limited in high compositional scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts.

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

TL;DR

Abstract

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)