M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

Haolong Yan; Kaijun Tan; Yeqing Shen; Xin Huang; Zheng Ge; Xiangyu Zhang; Si Li; Daxin Jiang

M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

Haolong Yan, Kaijun Tan, Yeqing Shen, Xin Huang, Zheng Ge, Xiangyu Zhang, Si Li, Daxin Jiang

TL;DR

This work introduces M-DocSum-Bench, a challenging multimodal benchmark that requires interleaved image-text document summarization from 500 arXiv papers, paired with an automated, human-validated evaluation framework (M-DocEval). It demonstrates that leading LVLMs struggle to maintain coherence and integrate long-range interleaved content, and proposes M-DocSum-7B, a robust 7B baseline trained in two stages (instruction-tuning and Direct Preference Optimization) that achieves state-of-the-art results among open and many closed models. The benchmark and evaluation framework highlight key challenges in image referencing, cross-modal alignment, and long-context understanding, providing a path toward more capable interleaved multimodal document understanding. Overall, the work delivers a reproducible protocol, strong empirical insights, and a practical baseline that advances the study of interleaved image-text comprehension in long scientific documents.

Abstract

We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench), which comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences. M-DocSum-Bench is a reference-based generation task and necessitates the generation of interleaved image-text summaries using provided reference images, thereby simultaneously evaluating capabilities in understanding, reasoning, localization, and summarization within complex multimodal document scenarios. To facilitate this benchmark, we develop an automated framework to construct summaries and propose a fine-grained evaluation method called M-DocEval. Moreover, we further develop a robust summarization baseline, i.e., M-DocSum-7B, by progressive two-stage training with diverse instruction and preference data. The extensive results on our M-DocSum-Bench reveal that the leading LVLMs struggle to maintain coherence and accurately integrate information within long and interleaved contexts, often exhibiting confusion between similar images and a lack of robustness. Notably, M-DocSum-7B achieves state-of-the-art performance compared to larger and closed-source models (including GPT-4o, Gemini Pro, Claude-3.5-Sonnet and Qwen2.5-VL-72B, etc.), demonstrating the potential of LVLMs for improved interleaved image-text understanding. The code, data, and models are available at https://github.com/stepfun-ai/M-DocSum-Bench.

M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

TL;DR

Abstract

M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)