MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation

Yongyue Zhang; Yaxiong Wu

MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation

Yongyue Zhang, Yaxiong Wu

TL;DR

MLDocRAG tackles multimodal long-context document QA by introducing a unified, query-centric retrieval framework. It leverages a Multimodal Chunk-Query Graph (MCQG) built via MDoc2Query to link heterogeneous chunks to semantically rich queries, enabling fine-grained, cross-modal, and cross-page evidence aggregation. Through two-stage construction and usage, plus non-parametric and parametric optimizations, MLDocRAG achieves superior accuracy on MMLongBench-Doc and LongDocURL and demonstrates robust grounding and multi-hop reasoning. The approach advances practical multimodal long-context understanding by integrating document expansion with graph-based retrieval, while offering scalable storage and retrieval in a modality-aware setting.

Abstract

Understanding multimodal long-context documents that comprise multimodal chunks such as paragraphs, figures, and tables is challenging due to (1) cross-modal heterogeneity to localize relevant information across modalities, (2) cross-page reasoning to aggregate dispersed evidence across pages. To address these challenges, we are motivated to adopt a query-centric formulation that projects cross-modal and cross-page information into a unified query representation space, with queries acting as abstract semantic surrogates for heterogeneous multimodal content. In this paper, we propose a Multimodal Long-Context Document Retrieval Augmented Generation (MLDocRAG) framework that leverages a Multimodal Chunk-Query Graph (MCQG) to organize multimodal document content around semantically rich, answerable queries. MCQG is constructed via a multimodal document expansion process that generates fine-grained queries from heterogeneous document chunks and links them to their corresponding content across modalities and pages. This graph-based structure enables selective, query-centric retrieval and structured evidence aggregation, thereby enhancing grounding and coherence in long-context multimodal question answering. Experiments on datasets MMLongBench-Doc and LongDocURL demonstrate that MLDocRAG consistently improves retrieval quality and answer accuracy, demonstrating its effectiveness for long-context multimodal understanding.

MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation

TL;DR

Abstract

Paper Structure (49 sections, 11 equations, 7 figures, 2 tables)

This paper contains 49 sections, 11 equations, 7 figures, 2 tables.

Introduction
Related Work
Long-Context Document Understanding
Multimodal RAG
Document Expansion
Methodology
Preliminaries
Framework Overview
MCQG Construction.
MCQG Usage.
MCQG Construction
(1) Document Parsing.
(2) MDoc2Query.
(3) Graph Assembly.
(4) Storage.
...and 34 more sections

Figures (7)

Figure 1: Illustration of RAG for multimodal long-context documents, comparing (a)–(e) baselines with (f) our MLDocRAG.
Figure 2: Overview of our proposed MLDocRAG framework, consisting of (a) MCQG Construction for building a multimodal chuck-query graph and (b) MCQG Usage for retrieving relevant multimodal chunks in long-document QA.
Figure 3: Ablation on node variants across both datasets.
Figure 4: Ablation on hyperparameters (MMLongBench-Doc).
Figure 5: Ablation on hyperparameters (LongDocURL).
...and 2 more figures

MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation

TL;DR

Abstract

MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)