Table of Contents
Fetching ...

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

Qirui Jiao, Daoyuan Chen, Yilun Huang, Xika Lin, Ying Shen, Yaliang Li

TL;DR

The first comprehensive benchmark specifically designed to evaluate T2I models'systematic abilities to handle extended textual inputs that contain complex compositional requirements is presented, revealing fundamental limitations in compositional reasoning.

Abstract

While recent text-to-image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, their performance significantly degrades when confronted with long, detail-intensive prompts required in professional applications. We present DetailMaster, the first comprehensive benchmark specifically designed to evaluate T2I models' systematic abilities to handle extended textual inputs that contain complex compositional requirements. Our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Spatial/Interactive Relationships. The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators. Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations: state-of-the-art models achieve merely $\sim$50\% accuracy in key dimensions like attribute binding and spatial reasoning, while all models showing progressive performance degradation as prompt length increases. Our analysis reveals fundamental limitations in compositional reasoning, demonstrating that current encoders flatten complex grammatical structures and that diffusion models suffer from attribute leakage under detail-intensive conditions. We open-source our dataset, data curation code, and evaluation tools to advance detail-rich T2I generation and enable applications previously hindered by the lack of a dedicated benchmark.

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

TL;DR

The first comprehensive benchmark specifically designed to evaluate T2I models'systematic abilities to handle extended textual inputs that contain complex compositional requirements is presented, revealing fundamental limitations in compositional reasoning.

Abstract

While recent text-to-image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, their performance significantly degrades when confronted with long, detail-intensive prompts required in professional applications. We present DetailMaster, the first comprehensive benchmark specifically designed to evaluate T2I models' systematic abilities to handle extended textual inputs that contain complex compositional requirements. Our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Spatial/Interactive Relationships. The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators. Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations: state-of-the-art models achieve merely 50\% accuracy in key dimensions like attribute binding and spatial reasoning, while all models showing progressive performance degradation as prompt length increases. Our analysis reveals fundamental limitations in compositional reasoning, demonstrating that current encoders flatten complex grammatical structures and that diffusion models suffer from attribute leakage under detail-intensive conditions. We open-source our dataset, data curation code, and evaluation tools to advance detail-rich T2I generation and enable applications previously hindered by the lack of a dedicated benchmark.

Paper Structure

This paper contains 48 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Text-to-image errors in long prompt scenario (with FLUX.1-dev). Real source image (left), detailed caption/prompt (middle), and generated image (right). Red text indicates failure points.
  • Figure 2: Overview diagram of the data construction process for the DetailMaster benchmark.
  • Figure 3: Negative correlation between generation accuracy and prompt token length.
  • Figure 4: The visualization of the nine grids: (a) shows the upper part and the lower part; (b) shows the left part and the right part; (c) shows the middle part; while (d) shows all four corner regions: the upper left part, the lower left part, the upper right part, and the lower right part.
  • Figure 5: Overview diagram of the evaluation pipeline for the DetailMaster benchmark.
  • ...and 3 more figures