Table of Contents
Fetching ...

RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering

Rongyang Zhang, Yuqing Huang, Chengqiang Lu, Qimeng Wang, Yan Gao, Yi Wu, Yao Hu, Yin Xu, Wei Wang, Hao Wang, Enhong Chen

TL;DR

RAG-IGBench targets robust evaluation of interleaved image-text generation in open-domain QA using Retrieval-Augmented Generation with multimodal LLMs. It introduces a three-dimensional evaluation framework (text quality, image quality, image-text coherence) and a 6,057-sample Xiaohongshu-derived dataset, with novel metrics and human-correlation validation. The study demonstrates that fine-tuning on the benchmark improves multimodal performance and that the proposed metrics align with human judgments, validating the framework. Open-source and proprietary experiments reveal current strengths and limits of MLLMs in handling complex multimodal, interleaved content, signaling pathways for future improvements.

Abstract

In real-world scenarios, providing user queries with visually enhanced responses can considerably benefit understanding and memory, underscoring the great value of interleaved image-text generation. Despite recent progress, like the visual autoregressive model that unifies text and image processing in a single transformer architecture, generating high-quality interleaved content remains challenging. Moreover, evaluations of these interleaved sequences largely remain underexplored, with existing benchmarks often limited by unimodal metrics that inadequately assess the intricacies of combined image-text outputs. To address these issues, we present RAG-IGBench, a thorough benchmark designed specifically to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering. RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content. Distinct from previous datasets, RAG-IGBench draws on the latest publicly available content from social platforms and introduces innovative evaluation metrics that measure the quality of text and images, as well as their consistency. Through extensive experiments with state-of-the-art MLLMs (both open-source and proprietary) on RAG-IGBench, we provide an in-depth analysis examining the capabilities and limitations of these models. Additionally, we validate our evaluation metrics by demonstrating their high correlation with human assessments. Models fine-tuned on RAG-IGBench's training set exhibit improved performance across multiple benchmarks, confirming both the quality and practical utility of our dataset. Our benchmark is available at https://github.com/USTC-StarTeam/RAG-IGBench.

RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering

TL;DR

RAG-IGBench targets robust evaluation of interleaved image-text generation in open-domain QA using Retrieval-Augmented Generation with multimodal LLMs. It introduces a three-dimensional evaluation framework (text quality, image quality, image-text coherence) and a 6,057-sample Xiaohongshu-derived dataset, with novel metrics and human-correlation validation. The study demonstrates that fine-tuning on the benchmark improves multimodal performance and that the proposed metrics align with human judgments, validating the framework. Open-source and proprietary experiments reveal current strengths and limits of MLLMs in handling complex multimodal, interleaved content, signaling pathways for future improvements.

Abstract

In real-world scenarios, providing user queries with visually enhanced responses can considerably benefit understanding and memory, underscoring the great value of interleaved image-text generation. Despite recent progress, like the visual autoregressive model that unifies text and image processing in a single transformer architecture, generating high-quality interleaved content remains challenging. Moreover, evaluations of these interleaved sequences largely remain underexplored, with existing benchmarks often limited by unimodal metrics that inadequately assess the intricacies of combined image-text outputs. To address these issues, we present RAG-IGBench, a thorough benchmark designed specifically to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering. RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content. Distinct from previous datasets, RAG-IGBench draws on the latest publicly available content from social platforms and introduces innovative evaluation metrics that measure the quality of text and images, as well as their consistency. Through extensive experiments with state-of-the-art MLLMs (both open-source and proprietary) on RAG-IGBench, we provide an in-depth analysis examining the capabilities and limitations of these models. Additionally, we validate our evaluation metrics by demonstrating their high correlation with human assessments. Models fine-tuned on RAG-IGBench's training set exhibit improved performance across multiple benchmarks, confirming both the quality and practical utility of our dataset. Our benchmark is available at https://github.com/USTC-StarTeam/RAG-IGBench.

Paper Structure

This paper contains 24 sections, 3 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Data example and statistics of input.
  • Figure 2: Overview of the pipeline of RAG-IG and the data construction of RAG-IGBench
  • Figure 3: Ablation experiment result of different resolutions and amounts of input images.
  • Figure 4: The distribution of distinct types of queries in RAG-IGBench.
  • Figure 5: Comparison of results generated by different models
  • ...and 1 more figures