Table of Contents
Fetching ...

Holistic Evaluation for Interleaved Text-and-Image Generation

Minqian Liu, Zhiyang Xu, Zihao Lin, Trevor Ashby, Joy Rimchala, Jiaxin Zhang, Lifu Huang

TL;DR

<3-5 sentence high-level summary> InterleavedBench and InterleavedEval address a critical gap in evaluating interleaved text-and-image generation by providing a diverse, instruction-rich benchmark and a strong, reference-free, multi-aspect evaluation metric based on GPT-4o. The dataset spans 10 real-world use cases with context-based and context-free subsets, totaling 815 instances, and emphasizes instruction-following and cross-modal coherence. Experimental results show pipeline approaches (LLMs plus image generators) outperform integrated multimodal models, while InterleavedEval achieves higher agreement with human judgments than prior metrics, highlighting both progress and remaining challenges, especially for image coherence. The work offers a practical framework and resources to guide future development in holistic interleaved generation and its evaluation, with implications for real-world multimodal content creation and storytelling.

Abstract

Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.

Holistic Evaluation for Interleaved Text-and-Image Generation

TL;DR

<3-5 sentence high-level summary> InterleavedBench and InterleavedEval address a critical gap in evaluating interleaved text-and-image generation by providing a diverse, instruction-rich benchmark and a strong, reference-free, multi-aspect evaluation metric based on GPT-4o. The dataset spans 10 real-world use cases with context-based and context-free subsets, totaling 815 instances, and emphasizes instruction-following and cross-modal coherence. Experimental results show pipeline approaches (LLMs plus image generators) outperform integrated multimodal models, while InterleavedEval achieves higher agreement with human judgments than prior metrics, highlighting both progress and remaining challenges, especially for image coherence. The work offers a practical framework and resources to guide future development in holistic interleaved generation and its evaluation, with implications for real-world multimodal content creation and storytelling.

Abstract

Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.
Paper Structure (37 sections, 6 figures, 10 tables)

This paper contains 37 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Overview of our InterleavedBench, a comprehensive benchmark that covers 10 diverse use cases for interleaved text-and-image generation, and the evaluation results of InterleavedEval based on GPT-4o.
  • Figure 2: Comparison between the existing benchmark (multi-concept image composition kumari2022customdiffusion) and our InterleavedBench. Compared with the existing benchmark, InterleavedBench has the following features: (1) both input and output can have arbitrarily interleaved text and images, and (2) each instance has a detailed instruction to benchmark models' instruction-following capability.
  • Figure 3: Illustration of examples in our InterleavedBench from six representative use cases.
  • Figure 4: The distribution of the use cases in InterleavedBench.
  • Figure 5: Case study. We select the representative examples of the system outputs from GILL, EMU-2, Gemini+SDXL, and GPT-4+DALLE3.
  • ...and 1 more figures