Table of Contents
Fetching ...

MHier-RAG: Multi-Modal RAG for Visual-Rich Document Question-Answering via Hierarchical and Multi-Granularity Reasoning

Ziyu Gong, Chengcheng Mai, Yihua Huang

TL;DR

This work tackles multi-modal long-context Doc-QA by introducing MHier-RAG, a retrieval-augmented approach with a hierarchical index that jointly connects in-page modalities and cross-page content. It combines flattened in-page chunks and topological cross-page chunks with a multi-granularity retrieval scheme (page-level parent-page retrieval and document-level summary retrieval) and LLM-based re-ranking, enabling robust multi-modal integration and long-distance reasoning. Empirical results on MMLongBench-Doc and LongDocURL show substantial gains over LVLM- and prior RAG-based methods, and ablations confirm the importance of visual information, broad parent-page retrieval, and cross-page summaries. The method offers a scalable, generalizable framework for visually rich, multi-page documents, with strong implications for accurate, modality-aware document question answering in real-world applications.

Abstract

The multi-modal long-context document question-answering task aims to locate and integrate multi-modal evidences (such as texts, tables, charts, images, and layouts) distributed across multiple pages, for question understanding and answer generation. The existing methods can be categorized into Large Vision-Language Model (LVLM)-based and Retrieval-Augmented Generation (RAG)-based methods. However, the former were susceptible to hallucinations, while the latter struggled for inter-modal disconnection and cross-page fragmentation. To address these challenges, a novel multi-modal RAG model, named MHier-RAG, was proposed, leveraging both textual and visual information across long-range pages to facilitate accurate question answering for visual-rich documents. A hierarchical indexing method with the integration of flattened in-page chunks and topological cross-page chunks was designed to jointly establish in-page multi-modal associations and long-distance cross-page dependencies. By means of joint similarity evaluation and large language model (LLM)-based re-ranking, a multi-granularity semantic retrieval method, including the page-level parent page retrieval and document-level summary retrieval, was proposed to foster multi-modal evidence connection and long-distance evidence integration and reasoning. Experimental results performed on public datasets, MMLongBench-Doc and LongDocURL, demonstrated the superiority of our MHier-RAG method in understanding and answering modality-rich and multi-page documents.

MHier-RAG: Multi-Modal RAG for Visual-Rich Document Question-Answering via Hierarchical and Multi-Granularity Reasoning

TL;DR

This work tackles multi-modal long-context Doc-QA by introducing MHier-RAG, a retrieval-augmented approach with a hierarchical index that jointly connects in-page modalities and cross-page content. It combines flattened in-page chunks and topological cross-page chunks with a multi-granularity retrieval scheme (page-level parent-page retrieval and document-level summary retrieval) and LLM-based re-ranking, enabling robust multi-modal integration and long-distance reasoning. Empirical results on MMLongBench-Doc and LongDocURL show substantial gains over LVLM- and prior RAG-based methods, and ablations confirm the importance of visual information, broad parent-page retrieval, and cross-page summaries. The method offers a scalable, generalizable framework for visually rich, multi-page documents, with strong implications for accurate, modality-aware document question answering in real-world applications.

Abstract

The multi-modal long-context document question-answering task aims to locate and integrate multi-modal evidences (such as texts, tables, charts, images, and layouts) distributed across multiple pages, for question understanding and answer generation. The existing methods can be categorized into Large Vision-Language Model (LVLM)-based and Retrieval-Augmented Generation (RAG)-based methods. However, the former were susceptible to hallucinations, while the latter struggled for inter-modal disconnection and cross-page fragmentation. To address these challenges, a novel multi-modal RAG model, named MHier-RAG, was proposed, leveraging both textual and visual information across long-range pages to facilitate accurate question answering for visual-rich documents. A hierarchical indexing method with the integration of flattened in-page chunks and topological cross-page chunks was designed to jointly establish in-page multi-modal associations and long-distance cross-page dependencies. By means of joint similarity evaluation and large language model (LLM)-based re-ranking, a multi-granularity semantic retrieval method, including the page-level parent page retrieval and document-level summary retrieval, was proposed to foster multi-modal evidence connection and long-distance evidence integration and reasoning. Experimental results performed on public datasets, MMLongBench-Doc and LongDocURL, demonstrated the superiority of our MHier-RAG method in understanding and answering modality-rich and multi-page documents.

Paper Structure

This paper contains 24 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Two challenges for multi-modal long-context document question-answering.
  • Figure 2: Overview of MHier-RAG with hierarchical index and multi-granularity retrieval for multi-modal Doc-QA. ($P_n$ is the parent page of $c_n$, $cls_K$ is the clustered block, 'CoT' denotes chain-of-thought and 'SO' denotes a structured output format.)
  • Figure 3: The trend of our MHier-RAG model performance changing with the page number and summary number on the MMLongBench-Doc dataset.
  • Figure 4: Case Study on dataset MMLongBench-Doc and LongDocURL to compare the answer response of our MHier-RAG and LVLM-based methods (such as GPT-4o, DeepSeek-chat, Qwen-VL-Plus and ERNIE-Turbo).
  • Figure 5: Necessities of multi-modal connection and long-distance reasoning for multi-modal long-context Doc-QA methods.