DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

Qirui Jiao; Daoyuan Chen; Yilun Huang; Xika Lin; Ying Shen; Yaliang Li

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

Qirui Jiao, Daoyuan Chen, Yilun Huang, Xika Lin, Ying Shen, Yaliang Li

TL;DR

The first comprehensive benchmark specifically designed to evaluate T2I models'systematic abilities to handle extended textual inputs that contain complex compositional requirements is presented, revealing fundamental limitations in compositional reasoning.

Abstract

While recent text-to-image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, their performance significantly degrades when confronted with long, detail-intensive prompts required in professional applications. We present DetailMaster, the first comprehensive benchmark specifically designed to evaluate T2I models' systematic abilities to handle extended textual inputs that contain complex compositional requirements. Our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Spatial/Interactive Relationships. The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators. Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations: state-of-the-art models achieve merely $\sim$50\% accuracy in key dimensions like attribute binding and spatial reasoning, while all models showing progressive performance degradation as prompt length increases. Our analysis reveals fundamental limitations in compositional reasoning, demonstrating that current encoders flatten complex grammatical structures and that diffusion models suffer from attribute leakage under detail-intensive conditions. We open-source our dataset, data curation code, and evaluation tools to advance detail-rich T2I generation and enable applications previously hindered by the lack of a dedicated benchmark.

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

TL;DR

Abstract

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)