Table of Contents
Fetching ...

LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations

Zhichao Yang, Tianjiao Gu, Jianjie Wang, Feiyu Lin, Xiangfei Sheng, Pengfei Chen, Leida Li

TL;DR

This work tackles the challenge of evaluating long-text-to-image (T2I) alignment by introducing LongT2IBench, a 14K long text-image benchmark with graph-structured annotations generated through a Generate-Refine-Qualify protocol. It enables fine-grained, interpretable alignment assessments by converting prompts into textual graphs of entities, attributes, and relations, and producing both alignment scores and structured interpretations. Building on this dataset, LongT2IExpert is proposed as a multimodal evaluator that uses Hierarchical Alignment Chain-of-Thought to produce quantitative scores and JSON-based interpretations, trained with LoRA in a multi-task setup. Experimental results show LongT2IExpert outperforms existing evaluators across varying prompt lengths and provides more reliable alignment interpretations, marking a meaningful step toward automatic, interpretable long-prompt T2I evaluation.

Abstract

The increasing popularity of long Text-to-Image (T2I) generation has created an urgent need for automatic and interpretable models that can evaluate the image-text alignment in long prompt scenarios. However, the existing T2I alignment benchmarks predominantly focus on short prompt scenarios and only provide MOS or Likert scale annotations. This inherent limitation hinders the development of long T2I evaluators, particularly in terms of the interpretability of alignment. In this study, we contribute LongT2IBench, which comprises 14K long text-image pairs accompanied by graph-structured human annotations. Given the detail-intensive nature of long prompts, we first design a Generate-Refine-Qualify annotation protocol to convert them into textual graph structures that encompass entities, attributes, and relations. Through this transformation, fine-grained alignment annotations are achieved based on these granular elements. Finally, the graph-structed annotations are converted into alignment scores and interpretations to facilitate the design of T2I evaluation models. Based on LongT2IBench, we further propose LongT2IExpert, a LongT2I evaluator that enables multi-modal large language models (MLLMs) to provide both quantitative scores and structured interpretations through an instruction-tuning process with Hierarchical Alignment Chain-of-Thought (CoT). Extensive experiments and comparisons demonstrate the superiority of the proposed LongT2IExpert in alignment evaluation and interpretation. Data and code have been released in https://welldky.github.io/LongT2IBench-Homepage/.

LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations

TL;DR

This work tackles the challenge of evaluating long-text-to-image (T2I) alignment by introducing LongT2IBench, a 14K long text-image benchmark with graph-structured annotations generated through a Generate-Refine-Qualify protocol. It enables fine-grained, interpretable alignment assessments by converting prompts into textual graphs of entities, attributes, and relations, and producing both alignment scores and structured interpretations. Building on this dataset, LongT2IExpert is proposed as a multimodal evaluator that uses Hierarchical Alignment Chain-of-Thought to produce quantitative scores and JSON-based interpretations, trained with LoRA in a multi-task setup. Experimental results show LongT2IExpert outperforms existing evaluators across varying prompt lengths and provides more reliable alignment interpretations, marking a meaningful step toward automatic, interpretable long-prompt T2I evaluation.

Abstract

The increasing popularity of long Text-to-Image (T2I) generation has created an urgent need for automatic and interpretable models that can evaluate the image-text alignment in long prompt scenarios. However, the existing T2I alignment benchmarks predominantly focus on short prompt scenarios and only provide MOS or Likert scale annotations. This inherent limitation hinders the development of long T2I evaluators, particularly in terms of the interpretability of alignment. In this study, we contribute LongT2IBench, which comprises 14K long text-image pairs accompanied by graph-structured human annotations. Given the detail-intensive nature of long prompts, we first design a Generate-Refine-Qualify annotation protocol to convert them into textual graph structures that encompass entities, attributes, and relations. Through this transformation, fine-grained alignment annotations are achieved based on these granular elements. Finally, the graph-structed annotations are converted into alignment scores and interpretations to facilitate the design of T2I evaluation models. Based on LongT2IBench, we further propose LongT2IExpert, a LongT2I evaluator that enables multi-modal large language models (MLLMs) to provide both quantitative scores and structured interpretations through an instruction-tuning process with Hierarchical Alignment Chain-of-Thought (CoT). Extensive experiments and comparisons demonstrate the superiority of the proposed LongT2IExpert in alignment evaluation and interpretation. Data and code have been released in https://welldky.github.io/LongT2IBench-Homepage/.

Paper Structure

This paper contains 15 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of the graph-structured annotations of LongT2IBench. Compared to existing benchmarks, we focus on T2I alignment in long prompt scenarios, offering both quantitative scores and fine-grained interpretations.
  • Figure 2: Overview of the construction pipeline for LongT2IBench. The pipeline consists of three stages: (a) Data Preparation. Long prompts are collected from three sources and input into various T2I models to generate images. (b) Data Annotation. Long prompts are converted into textual graph structures, and fine-grained image-textual graph alignment annotations are achieved. (c) Label Generation. Two categories of labels: quantitative alignment scores and alignment interpretations are produced based on graph-structured human annotations.
  • Figure 3: The Source Distribution Map of LongPrompt-3K. The long prompts are sampled from three sources: Human-Gen, AI-Gen and Img-Cap. These prompts are evenly distributed across different word count ranges, with the number of entities, attributes, and relationships in each range also being statistically presented.
  • Figure 4: Statistical Analysis. (a) Average alignment scores across five word-count intervals. (b) Distribution of annotated alignment scores for six T2I generative models. (c) Alignment and misalignment rates across entities, attributes, and relations. (d) Alignment percentages among six relation categories (Action, Connection, Description, Possession, From/to and Spatial relation).
  • Figure 5: Overall pipeline of the proposed LongT2IExpert. A Hierarchical Alignment Chain-of-Thought (CoT) is designed to instruct MLLMs for structured alignment reasoning. Numerical alignment scores and graph-structured interpretations are utilized to train MLLMs for alignment scoring and interpreting in a multi-task manner.
  • ...and 1 more figures