Table of Contents
Fetching ...

TIIF-Bench: How Does Your T2I Model Follow Your Instructions?

Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, Lei Zhang

TL;DR

TIIF-Bench introduces a large, hierarchically structured benchmark to rigorously assess T2I models' instruction-following capabilities. It overcomes prior benchmarks’ flaws by using 5000 prompts across 36 attribute combinations and three difficulty levels, plus new dimensions for text rendering and style control, with length-augmented prompts and designer prompts. A novel evaluation protocol leverages world-knowledge in vision-language models to perform attribute-specific yes/no judgments and uses GNED to quantify text rendering fidelity. Empirical results reveal architecture-specific strengths and trade-offs, show strong correlation with human judgments, and demonstrate that instruction understanding is closely tied to generation quality, especially when comparing closed-, open-, diffusion-, and autoregressive models. The work provides practical guidance for developing more instruction-adaptive T2I systems and highlights directions for expanding benchmarks beyond English, common objects, and basic stylistic cues.

Abstract

The rapid advancements of Text-to-Image (T2I) models have ushered in a new phase of AI-generated content, marked by their growing ability to interpret and follow user instructions. However, existing T2I model evaluation benchmarks fall short in limited prompt diversity and complexity, as well as coarse evaluation metrics, making it difficult to evaluate the fine-grained alignment performance between textual instructions and generated images. In this paper, we present TIIF-Bench (Text-to-Image Instruction Following Benchmark), aiming to systematically assess T2I models' ability in interpreting and following intricate textual instructions. TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions, which are categorized into three levels of difficulties and complexities. To rigorously evaluate model robustness to varying prompt lengths, we provide a short and a long version for each prompt with identical core semantics. Two critical attributes, i.e., text rendering and style control, are introduced to evaluate the precision of text synthesis and the aesthetic coherence of T2I models. In addition, we collect 100 high-quality designer level prompts that encompass various scenarios to comprehensively assess model performance. Leveraging the world knowledge encoded in large vision language models, we propose a novel computable framework to discern subtle variations in T2I model outputs. Through meticulous benchmarking of mainstream T2I models on TIIF-Bench, we analyze the pros and cons of current T2I models and reveal the limitations of current T2I benchmarks. Project Page: https://a113n-w3i.github.io/TIIF_Bench/.

TIIF-Bench: How Does Your T2I Model Follow Your Instructions?

TL;DR

TIIF-Bench introduces a large, hierarchically structured benchmark to rigorously assess T2I models' instruction-following capabilities. It overcomes prior benchmarks’ flaws by using 5000 prompts across 36 attribute combinations and three difficulty levels, plus new dimensions for text rendering and style control, with length-augmented prompts and designer prompts. A novel evaluation protocol leverages world-knowledge in vision-language models to perform attribute-specific yes/no judgments and uses GNED to quantify text rendering fidelity. Empirical results reveal architecture-specific strengths and trade-offs, show strong correlation with human judgments, and demonstrate that instruction understanding is closely tied to generation quality, especially when comparing closed-, open-, diffusion-, and autoregressive models. The work provides practical guidance for developing more instruction-adaptive T2I systems and highlights directions for expanding benchmarks beyond English, common objects, and basic stylistic cues.

Abstract

The rapid advancements of Text-to-Image (T2I) models have ushered in a new phase of AI-generated content, marked by their growing ability to interpret and follow user instructions. However, existing T2I model evaluation benchmarks fall short in limited prompt diversity and complexity, as well as coarse evaluation metrics, making it difficult to evaluate the fine-grained alignment performance between textual instructions and generated images. In this paper, we present TIIF-Bench (Text-to-Image Instruction Following Benchmark), aiming to systematically assess T2I models' ability in interpreting and following intricate textual instructions. TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions, which are categorized into three levels of difficulties and complexities. To rigorously evaluate model robustness to varying prompt lengths, we provide a short and a long version for each prompt with identical core semantics. Two critical attributes, i.e., text rendering and style control, are introduced to evaluate the precision of text synthesis and the aesthetic coherence of T2I models. In addition, we collect 100 high-quality designer level prompts that encompass various scenarios to comprehensively assess model performance. Leveraging the world knowledge encoded in large vision language models, we propose a novel computable framework to discern subtle variations in T2I model outputs. Through meticulous benchmarking of mainstream T2I models on TIIF-Bench, we analyze the pros and cons of current T2I models and reveal the limitations of current T2I benchmarks. Project Page: https://a113n-w3i.github.io/TIIF_Bench/.

Paper Structure

This paper contains 27 sections, 2 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Prompt diversity/complexity of TIIF-Bench compared to prior benchmarks. (Left) Semantic uniqueness after de-duplication using a cosine similarity threshold of 0.85: less than 30% and 60% prompts in CompBench++ and GenAI Bench are unique, while more than 90% prompts in TIIF-Bench are unique. (Middle) t‑SNE visualization of CLIP text embeddings shows that TIIF‑Bench spans a much broader semantic space than existing benchmarks. (Right) Prompts length in CompBench++ and GenAI Bench are fixed and short, while TIIF-Bench covers a much wider range. As shown in the lower-left corner, the longest prompts in CompBench++ and GenAI Bench are less than 50 words, while TIIF-Bench contains complex prompts exceeding 500 words.
  • Figure 2: Prompt‑length sensitivity on the Numeracy attribute (“more birds than fish”) from TIIF Bench. Short prompt: “The birds are more numerous than the fish.” Long prompt: “The birds, with their feathers catching the gentle light of dawn, vastly outnumber their aquatic counterparts, the fish, which glide silently beneath the rippling surface of the water, their sleek forms moving like shadows in the depths below.” We observe that PixArt-Sigma and SD 3.5 exhibit noticeable sensitivity to prompt length, with their ability to follow the core semantic instruction varying between the short and long versions of the same prompt. In contrast, DALL·E 3 and GPT-4o demonstrate strong robustness, maintaining consistent instruction-following performance regardless of prompt length.
  • Figure 3: Illustration of the limitations of expert scorers widely used in T2I benchmarks such as CompBench++ and GenAI Bench. CLIP and BLIP can only assess image-level alignment and they struggle with evaluating the fine-grained instruction-following capability of T2I models, often producing scores misaligned with human judgments. UniDet fails in most cases, unable to detect objects in complex scenes produced by modern T2I models. VQAScore, a variant of CLIP fine-tuned for image quality assessment, remains limited in addressing the complexity of modern generations. As a result, the best models selected by these expert scorers are different from the one selected via human subjective evaluation (i.e., GPT-4o).
  • Figure 4: Failure cases of current T2I evaluation methods under coarse prompt queries. Since the image caption is used as part of the prompt, these methods can easily induce hallucinations of VLMs such as GPT-4o and consequently lead to highly optimistic scores.
  • Figure 5: (Top): Prompt-construction and evaluation pipeline of TIIF-Bench. The example depicts a “Color + 3D Perspective” pairing, one of the 36 distinct attribute-pool combinations defined in our framework. (Bottom): In addition to systematically reusing prompts from existing benchmarks, we gather text-generation prompts from Lex-Art zhao2025lexartrethinkingtextgeneration and curate style control and real-world designer-level prompts. All prompts are further expanded by GPT-4o to produce long-form counterparts.
  • ...and 7 more figures