Table of Contents
Fetching ...

A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yifan Chang, Sizhuo Zhou, Shenglin Zhang, Yu Dai, Kaipeng Zhang

TL;DR

This work proposes SynJudge, a reliable automatic evaluator that aligns closely with human judge and outputs four interpretable scores that are complementary, covering both content and quality as well as cross-modal interaction, thereby forming a comprehensive evaluation framework.

Abstract

Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality, and instructional richness of current training datasets. To address this, we introduce InterSyn, a dataset that features: (1) large scale, comprising 1.8M multimodal samples; (2) high quality, supported by our proposed Self-Evaluation with Iterative Refinement (SEIR) method for rigorous automated quality refinement; (3) rich instructional diversity, ensured through diverse well-designed question templates, based on human preferences and covering a 3500-topic hierarchy. These characteristics make InterSyn particularly well-suited for training LMMs in interactive image-text generation capabilities. To evaluate the capabilities, we propose SynJudge, a reliable automatic evaluator that aligns closely with human judge and outputs four interpretable scores: Text Content Completeness (TCC), Image Content Completeness (ICC), Image Quality (IQ), and Image-Text Synergy (ITS). These scores are complementary, covering both content and quality as well as cross-modal interaction, thereby forming a comprehensive evaluation framework. Experimental results on InterSyn subsets of up to 200K samples show that 25K-50K already yield substantial improvements, while scaling to 100K/200K brings further gains in TCC, ICC, and especially ITS, highlighting InterSyn's: (1) scalability, as performance consistently improves with more data; (2) efficiency, as significant gains are achievable even with smaller subsets, making it accessible to researchers with varying computational resources.

A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

TL;DR

This work proposes SynJudge, a reliable automatic evaluator that aligns closely with human judge and outputs four interpretable scores that are complementary, covering both content and quality as well as cross-modal interaction, thereby forming a comprehensive evaluation framework.

Abstract

Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality, and instructional richness of current training datasets. To address this, we introduce InterSyn, a dataset that features: (1) large scale, comprising 1.8M multimodal samples; (2) high quality, supported by our proposed Self-Evaluation with Iterative Refinement (SEIR) method for rigorous automated quality refinement; (3) rich instructional diversity, ensured through diverse well-designed question templates, based on human preferences and covering a 3500-topic hierarchy. These characteristics make InterSyn particularly well-suited for training LMMs in interactive image-text generation capabilities. To evaluate the capabilities, we propose SynJudge, a reliable automatic evaluator that aligns closely with human judge and outputs four interpretable scores: Text Content Completeness (TCC), Image Content Completeness (ICC), Image Quality (IQ), and Image-Text Synergy (ITS). These scores are complementary, covering both content and quality as well as cross-modal interaction, thereby forming a comprehensive evaluation framework. Experimental results on InterSyn subsets of up to 200K samples show that 25K-50K already yield substantial improvements, while scaling to 100K/200K brings further gains in TCC, ICC, and especially ITS, highlighting InterSyn's: (1) scalability, as performance consistently improves with more data; (2) efficiency, as significant gains are achievable even with smaller subsets, making it accessible to researchers with varying computational resources.

Paper Structure

This paper contains 60 sections, 8 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: InterSyn: Topic hierarchy and interleaved question answering samples
  • Figure 2: Overview of the InterSyn Dataset Construction Framework. The top panel illustrates the dataset preparatory work, covering question collection, filtering, template standardization, and topic expansion. The bottom panel illustrates the Self-Evaluation with Iterative Refinement (SEIR) method, which employs a Generate-Evaluate-Refine loop across three cascaded stages. (1) Question Refinement (QR): An initial question $q_0$ is refined into the final question $q$ based on the topic $z$ and question template. (2) Answer Refinement (AR): Using $q$, the final answer $a$ and a temporary caption $\gamma$ are iteratively refined. (3) Image Refinement (IR): Initialize $c_0$ with $\gamma$. Refine the caption and image until the final image $I$ is produced. The right-side legend details the inputs and historical context ($\mathcal{H}^{(t-1)}$) used at each stage.
  • Figure 3: Impact of question refinement (QR) on question quality. This plot shows the quality scores across different QR iterations. Quality improves significantly over the first three iterations but plateaus thereafter.
  • Figure 4: Visualization of RMSE for different judges.
  • Figure 5: Examples of single-turn conversation
  • ...and 3 more figures