Table of Contents
Fetching ...

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, Jiebo Luo

TL;DR

OpenLEAF addresses open-domain interleaved image-text generation by integrating a prompting-based generation pipeline with global entity and style contexts and an SDXL-based image synthesis backend. It also introduces a large-model (LMM) evaluation pipeline using BingChat to assess entity and style consistency, validated on a 30-query benchmark across diverse tasks. The approach yields coherent, domain-diverse interleaved content and demonstrates that LMM-based evaluation aligns well with human judgments. The work provides a reproducible baseline, a benchmark, and an evaluation framework that together advance open-domain multimodal interleaved generation and its assessment.

Abstract

This work investigates a challenging task named open-domain interleaved image-text generation, which generates interleaved texts and images following an input query. We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF. In OpenLEAF, the LLM generates textual descriptions, coordinates T2I models, creates visual prompts for generating images, and incorporates global contexts into the T2I models. This global context improves the entity and style consistencies of images in the interleaved generation. For model assessment, we first propose to use large multi-modal models (LMMs) to evaluate the entity and style consistencies of open-domain interleaved image-text sequences. According to the LMM evaluation on our constructed evaluation set, the proposed interleaved generation framework can generate high-quality image-text content for various domains and applications, such as how-to question answering, storytelling, graphical story rewriting, and webpage/poster generation tasks. Moreover, we validate the effectiveness of the proposed LMM evaluation technique with human assessment. We hope our proposed framework, benchmark, and LMM evaluation could help establish the intriguing interleaved image-text generation task.

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

TL;DR

OpenLEAF addresses open-domain interleaved image-text generation by integrating a prompting-based generation pipeline with global entity and style contexts and an SDXL-based image synthesis backend. It also introduces a large-model (LMM) evaluation pipeline using BingChat to assess entity and style consistency, validated on a 30-query benchmark across diverse tasks. The approach yields coherent, domain-diverse interleaved content and demonstrates that LMM-based evaluation aligns well with human judgments. The work provides a reproducible baseline, a benchmark, and an evaluation framework that together advance open-domain multimodal interleaved generation and its assessment.

Abstract

This work investigates a challenging task named open-domain interleaved image-text generation, which generates interleaved texts and images following an input query. We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF. In OpenLEAF, the LLM generates textual descriptions, coordinates T2I models, creates visual prompts for generating images, and incorporates global contexts into the T2I models. This global context improves the entity and style consistencies of images in the interleaved generation. For model assessment, we first propose to use large multi-modal models (LMMs) to evaluate the entity and style consistencies of open-domain interleaved image-text sequences. According to the LMM evaluation on our constructed evaluation set, the proposed interleaved generation framework can generate high-quality image-text content for various domains and applications, such as how-to question answering, storytelling, graphical story rewriting, and webpage/poster generation tasks. Moreover, we validate the effectiveness of the proposed LMM evaluation technique with human assessment. We hope our proposed framework, benchmark, and LMM evaluation could help establish the intriguing interleaved image-text generation task.
Paper Structure (11 sections, 11 figures, 6 tables)

This paper contains 11 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Examples of open-domain interleaved content generation. We show baseline results on producing visual how-to instructions (top-left), generating multi-modal stories (top-right), converting textual stories to multi-modal stories (bottom-left), and generating webpages and posters via HTML and CSS codes (bottom-right).
  • Figure 1: The mean and variance of the BingChat evaluation on the benchmark dataset. Adding global context improves the averaged consistencies and lowers the variances.
  • Figure 2: Overviews of the proposed interleaved generation framework: (a) and LMM-based evaluation pipeline (b).
  • Figure 3: Interleaved visual-language generation results of OpenLEAF on story generation. We visualize the generated interleaved content on the left and the corresponding LMM-Evaluation results on the right. Please zoom in on the screen to see details.
  • Figure 4: Interleaved visual-language generation results of OpenLEAF on webpage generation. We visualize the generated interleaved content on the left and the corresponding LMM-Evaluation results on the right. Please zoom in on the screen to see details.
  • ...and 6 more figures