Table of Contents
Fetching ...

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Pengfei Zhou, Xiaopeng Peng, Jiajun Song, Chuanhao Li, Zhaopan Xu, Yue Yang, Ziyao Guo, Hao Zhang, Yuqi Lin, Yefei He, Lirui Zhao, Shuo Liu, Tianhua Li, Yuxuan Xie, Xiaojun Chang, Yu Qiao, Wenqi Shao, Kaipeng Zhang

TL;DR

This work introduces OpenING, a comprehensive benchmark for open-ended interleaved image-text generation, comprising 5,400 annotated instances across 56 tasks and 23 meta-topics to reflect real-world scenarios. It also introduces IntJudge, a robust offline judge trained with a Reference-Augmented Generation (RAG) data pipeline and evaluated via an Interleaved Arena of pairwise comparisons across seven criteria, achieving 82.42% agreement with human judgments and outperforming GPT-based evaluators. Experiments on OpenING reveal that current interleaved generation methods still struggle with coherence and quality, with integrated pipelines (e.g., GPT-4o+DALL-E-3, Gemini+Flux) generally outperforming end-to-end and two-stage architectures. The work highlights the value of large-scale, diverse interleaved data and a robust, scalable evaluation framework, while noting limitations in data diversity, multilingual coverage, and potential evaluator biases, suggesting directions for future multimodal evaluation research and RL-based improvements.

Abstract

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to limitations in data size and diversity. To bridge this gap, we introduce OpenING, a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82.42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models.

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

TL;DR

This work introduces OpenING, a comprehensive benchmark for open-ended interleaved image-text generation, comprising 5,400 annotated instances across 56 tasks and 23 meta-topics to reflect real-world scenarios. It also introduces IntJudge, a robust offline judge trained with a Reference-Augmented Generation (RAG) data pipeline and evaluated via an Interleaved Arena of pairwise comparisons across seven criteria, achieving 82.42% agreement with human judgments and outperforming GPT-based evaluators. Experiments on OpenING reveal that current interleaved generation methods still struggle with coherence and quality, with integrated pipelines (e.g., GPT-4o+DALL-E-3, Gemini+Flux) generally outperforming end-to-end and two-stage architectures. The work highlights the value of large-scale, diverse interleaved data and a robust, scalable evaluation framework, while noting limitations in data diversity, multilingual coverage, and potential evaluator biases, suggesting directions for future multimodal evaluation research and RL-based improvements.

Abstract

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to limitations in data size and diversity. To bridge this gap, we introduce OpenING, a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82.42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models.

Paper Structure

This paper contains 42 sections, 12 equations, 36 figures, 10 tables.

Figures (36)

  • Figure 1: Motivation: (a) Rapid progress of interleaved image-text generation. (b) Interleaved content is essential to provide key information for complex real-world tasks (e.g., product design).
  • Figure 2: OpenING benchmark consists of 23 meta-topics (inner ring) which are further categorized into 56 specific tasks (see the number of tasks on the outer ring and details in Supplementary Materials). Examples showcase interleaved generation in eight representative domains.
  • Figure 3: Overview of data curation and the proposed judge pipeline. (a) We construct our OpenING benchmark in a top-down manner, which involves five stages: conceptualization, data collection, annotation, filtering and processing. (b) We use the Dev Set of OpenING to train the proposed IntJudge and evaluate interleaved image-text generation on the Test Set to compare our IntJudge with human and GPT-4o.
  • Figure 4: Model win rates under image-only and text-only settings across different models, ranked by human judgments.
  • Figure 5: Win rate matrix of human and ten MLLM models, evaluated by human, GPT-4o, and our IntJudge, respectively.
  • ...and 31 more figures