Table of Contents
Fetching ...

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, Long Chen

TL;DR

The paper tackles the challenge of generating coherent interleaved image-text content by identifying data quality as a key bottleneck. It introduces CoMM, a high-quality dataset built from instructional and visual-story sources and filtered with multi-perspective LLM/CLIP pipelines to enforce narrative coherence, image consistency, and strong image-text alignment, plus a preference dataset for reinforcement learning. Four benchmark tasks and a comprehensive evaluation framework are proposed, and the dataset’s effectiveness is demonstrated through improved few-shot in-context multimodal understanding and generation across multiple tasks. This work emphasizes a data-centric path to advancing multimodal large language models and enabling more reliable multimodal in-context learning and generation.

Abstract

Interleaved image-text generation has emerged as a crucial multimodal task, aiming at creating sequences of interleaved visual and textual content given a query. Despite notable advancements in recent multimodal large language models (MLLMs), generating integrated image-text sequences that exhibit narrative coherence and entity and style consistency remains challenging due to poor training data quality. To address this gap, we introduce CoMM, a high-quality Coherent interleaved image-text MultiModal dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content. Initially, CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling, establishing a foundation for coherent and consistent content. To further refine the data quality, we devise a multi-perspective filter strategy that leverages advanced pre-trained models to ensure the development of sentences, consistency of inserted images, and semantic alignment between them. Various quality evaluation metrics are designed to prove the high quality of the filtered dataset. Meanwhile, extensive few-shot experiments on various downstream tasks demonstrate CoMM's effectiveness in significantly enhancing the in-context learning capabilities of MLLMs. Moreover, we propose four new tasks to evaluate MLLMs' interleaved generation abilities, supported by a comprehensive evaluation framework. We believe CoMM opens a new avenue for advanced MLLMs with superior multimodal in-context learning and understanding ability.

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

TL;DR

The paper tackles the challenge of generating coherent interleaved image-text content by identifying data quality as a key bottleneck. It introduces CoMM, a high-quality dataset built from instructional and visual-story sources and filtered with multi-perspective LLM/CLIP pipelines to enforce narrative coherence, image consistency, and strong image-text alignment, plus a preference dataset for reinforcement learning. Four benchmark tasks and a comprehensive evaluation framework are proposed, and the dataset’s effectiveness is demonstrated through improved few-shot in-context multimodal understanding and generation across multiple tasks. This work emphasizes a data-centric path to advancing multimodal large language models and enabling more reliable multimodal in-context learning and generation.

Abstract

Interleaved image-text generation has emerged as a crucial multimodal task, aiming at creating sequences of interleaved visual and textual content given a query. Despite notable advancements in recent multimodal large language models (MLLMs), generating integrated image-text sequences that exhibit narrative coherence and entity and style consistency remains challenging due to poor training data quality. To address this gap, we introduce CoMM, a high-quality Coherent interleaved image-text MultiModal dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content. Initially, CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling, establishing a foundation for coherent and consistent content. To further refine the data quality, we devise a multi-perspective filter strategy that leverages advanced pre-trained models to ensure the development of sentences, consistency of inserted images, and semantic alignment between them. Various quality evaluation metrics are designed to prove the high quality of the filtered dataset. Meanwhile, extensive few-shot experiments on various downstream tasks demonstrate CoMM's effectiveness in significantly enhancing the in-context learning capabilities of MLLMs. Moreover, we propose four new tasks to evaluate MLLMs' interleaved generation abilities, supported by a comprehensive evaluation framework. We believe CoMM opens a new avenue for advanced MLLMs with superior multimodal in-context learning and understanding ability.
Paper Structure (21 sections, 5 equations, 10 figures, 13 tables)

This paper contains 21 sections, 5 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Illustration of interleaved image-text content generation results and dataset quality. (a) Given the query from CoMM, the interleaved image-text content generation results from the model Emu2 sun2023generative separately trained by MMC4 zhu2024multimodal and CoMM (Ours). (b) The query is from the MMC4. (c) A training sample is from the MMC4. (d) A training sample is from the our CoMM.
  • Figure 2: Visualization of the image-sentence numbers per document distribution of three datasets. The $\mu$ and $M$ denote the mean/median number of images/sentences in documents, respectively.
  • Figure 3: Visualization of interleaved image-text content generation from SEED-Llama ge2023making (Top) and MiniGPT-5 zheng2023minigpt (Bottom) separately trained by MMC4 zhu2024multimodal and CoMM (Ours).
  • Figure 4: Comparison of samples from different datasets. (a) from the MMC4 zhu2024multimodal dataset; (b)-(f) from different data sources within CoMM (Ours) dataset.
  • Figure 5: Topic visualization of our dataset. 'Others' contain 'Exercise', 'Drawing & Design', 'Boating', etc., totaling 144 topics.
  • ...and 5 more figures