Table of Contents
Fetching ...

M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

Haolong Yan, Kaijun Tan, Yeqing Shen, Xin Huang, Zheng Ge, Xiangyu Zhang, Si Li, Daxin Jiang

TL;DR

This work introduces M-DocSum-Bench, a challenging multimodal benchmark that requires interleaved image-text document summarization from 500 arXiv papers, paired with an automated, human-validated evaluation framework (M-DocEval). It demonstrates that leading LVLMs struggle to maintain coherence and integrate long-range interleaved content, and proposes M-DocSum-7B, a robust 7B baseline trained in two stages (instruction-tuning and Direct Preference Optimization) that achieves state-of-the-art results among open and many closed models. The benchmark and evaluation framework highlight key challenges in image referencing, cross-modal alignment, and long-context understanding, providing a path toward more capable interleaved multimodal document understanding. Overall, the work delivers a reproducible protocol, strong empirical insights, and a practical baseline that advances the study of interleaved image-text comprehension in long scientific documents.

Abstract

We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench), which comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences. M-DocSum-Bench is a reference-based generation task and necessitates the generation of interleaved image-text summaries using provided reference images, thereby simultaneously evaluating capabilities in understanding, reasoning, localization, and summarization within complex multimodal document scenarios. To facilitate this benchmark, we develop an automated framework to construct summaries and propose a fine-grained evaluation method called M-DocEval. Moreover, we further develop a robust summarization baseline, i.e., M-DocSum-7B, by progressive two-stage training with diverse instruction and preference data. The extensive results on our M-DocSum-Bench reveal that the leading LVLMs struggle to maintain coherence and accurately integrate information within long and interleaved contexts, often exhibiting confusion between similar images and a lack of robustness. Notably, M-DocSum-7B achieves state-of-the-art performance compared to larger and closed-source models (including GPT-4o, Gemini Pro, Claude-3.5-Sonnet and Qwen2.5-VL-72B, etc.), demonstrating the potential of LVLMs for improved interleaved image-text understanding. The code, data, and models are available at https://github.com/stepfun-ai/M-DocSum-Bench.

M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

TL;DR

This work introduces M-DocSum-Bench, a challenging multimodal benchmark that requires interleaved image-text document summarization from 500 arXiv papers, paired with an automated, human-validated evaluation framework (M-DocEval). It demonstrates that leading LVLMs struggle to maintain coherence and integrate long-range interleaved content, and proposes M-DocSum-7B, a robust 7B baseline trained in two stages (instruction-tuning and Direct Preference Optimization) that achieves state-of-the-art results among open and many closed models. The benchmark and evaluation framework highlight key challenges in image referencing, cross-modal alignment, and long-context understanding, providing a path toward more capable interleaved multimodal document understanding. Overall, the work delivers a reproducible protocol, strong empirical insights, and a practical baseline that advances the study of interleaved image-text comprehension in long scientific documents.

Abstract

We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench), which comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences. M-DocSum-Bench is a reference-based generation task and necessitates the generation of interleaved image-text summaries using provided reference images, thereby simultaneously evaluating capabilities in understanding, reasoning, localization, and summarization within complex multimodal document scenarios. To facilitate this benchmark, we develop an automated framework to construct summaries and propose a fine-grained evaluation method called M-DocEval. Moreover, we further develop a robust summarization baseline, i.e., M-DocSum-7B, by progressive two-stage training with diverse instruction and preference data. The extensive results on our M-DocSum-Bench reveal that the leading LVLMs struggle to maintain coherence and accurately integrate information within long and interleaved contexts, often exhibiting confusion between similar images and a lack of robustness. Notably, M-DocSum-7B achieves state-of-the-art performance compared to larger and closed-source models (including GPT-4o, Gemini Pro, Claude-3.5-Sonnet and Qwen2.5-VL-72B, etc.), demonstrating the potential of LVLMs for improved interleaved image-text understanding. The code, data, and models are available at https://github.com/stepfun-ai/M-DocSum-Bench.

Paper Structure

This paper contains 30 sections, 5 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Illustration of the Interleaved Multimodal Summarization task. The document is an example from M-DocSum-Bench, which has $24$ pages and a total of $14$ images. The Interleaved Summarization is generated by our M-DocSum-7B model.
  • Figure 2: The quantitative indicators of M-DocSum-Bench display fundamental information such as token length, image count, and document topics.
  • Figure 3: The illustration of automated data construction pipeline, multi-roll data verification, and two-stage training.
  • Figure 4: Quantitative analysis of the performance trends of different models as data characteristics vary, including the different paragraphs in the interleaved summarization, the token length of the original text, and the number of input images.
  • Figure 5: Blue bars represent original image scores, orange bars represent scores after modification, and the green line indicates the decline rate.
  • ...and 8 more figures