Table of Contents
Fetching ...

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Siwei Han, Peng Xia, Ruiyi Zhang, Tong Sun, Yun Li, Hongtu Zhu, Huaxiu Yao

TL;DR

This paper introduces MDocAgent, a multi-modal, multi-agent framework for Document Question Answering (DocQA) that integrates text and image modalities through two parallel RAG pipelines and five specialized agents. By combining a general agent, a critical information extractor, text and image processing agents, and a summarizing agent, the approach enables refined cross-modal reasoning and robust final answers for long and visually rich documents. Extensive experiments on five benchmarks show consistent performance gains over state-of-the-art LVLM and RAG methods, with ablations confirming the value of each agent and the cross-modal synthesis mechanism. The work demonstrates the practicality of collaborative multi-agent architectures in complex DocQA tasks and points toward broader adoption in robust document understanding tasks.

Abstract

Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information. Our data and code are available at https://github.com/aiming-lab/MDocAgent.

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

TL;DR

This paper introduces MDocAgent, a multi-modal, multi-agent framework for Document Question Answering (DocQA) that integrates text and image modalities through two parallel RAG pipelines and five specialized agents. By combining a general agent, a critical information extractor, text and image processing agents, and a summarizing agent, the approach enables refined cross-modal reasoning and robust final answers for long and visually rich documents. Extensive experiments on five benchmarks show consistent performance gains over state-of-the-art LVLM and RAG methods, with ablations confirming the value of each agent and the cross-modal synthesis mechanism. The work demonstrates the practicality of collaborative multi-agent architectures in complex DocQA tasks and points toward broader adoption in robust document understanding tasks.

Abstract

Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information. Our data and code are available at https://github.com/aiming-lab/MDocAgent.

Paper Structure

This paper contains 27 sections, 4 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Comparison of different approaches for DocQA. LVLMs often struggle with long documents and lack granular attention to detail, while also exhibiting limitations in cross-modal understanding. Single-modal context retrieval can handle long documents but still suffers from issues with detailed analysis or integrating information across modalities. Our MDocAgent addresses these challenges by combining text and image-based RAG with specialized agents for refined processing within each modality and a critical information extraction mechanism, showcasing improved DocQA performance.
  • Figure 2: Overview of MDocAgent: A multi-modal multi-agent framework operating in five stages: (1) Documents are processed using PDF tools to extract text and images. (2) Text-based and image-based RAG retrieves the top-k relevant segments and image pages. (3) The general agent provides a preliminary answer, and the critical agent extracts critical information from both modalities. (4) Specialized agents process the retrieved information and critical information within their respective modalities and generate refined answers. (5) The summarizing agent integrates all previous outputs to generate the final answer.
  • Figure 3: A Case study of MDocAgent compared with other two RAG-method baselines(ColBERT + Llama 3.1-8B and M3DocRAG). Given a question comparing two population sizes, both baseline methods fail to arrive at the correct answer. Our framework, through the collaborative efforts of its specialized agents, successfully identifies the relevant information from both text and a table within the image, ultimately synthesizing the correct answer. This highlights the importance of granular, multi-modal analysis and the ability to accurately process information within the context.
  • Figure 4: A Case study of MDocAgent compared with other two baselines. While only ColPali correctly retrieves the evidence page, neither baseline method identifies the correct answer. Our method, through critical information sharing and specialized agent collaboration, correctly pinpoints the "Most Beautiful Campus" as the only reason without a corresponding image containing people.
  • Figure 5: A Case study of MDocAgent compared with other two RAG-method baselines. In this case, ColPali fails to retrieve the correct evidence page, hindering M3DocRAG. While ColBERT succeeds in retrieval, the ColBERT + Llama baseline still provides an incorrect answer. Only our multi-agent framework, through precise critical information extraction and agent collaboration, correctly identifies the M.A. degree.