Table of Contents
Fetching ...

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A. Rossi, Dinesh Manocha

TL;DR

VisDoMBench introduces a first-of-its-kind benchmark for multi-document QA with visually rich content and pairs it with VisDoMRAG, a multimodal Retrieval Augmented Generation framework that runs parallel textual and visual evidence pipelines and enforces consistency through modality fusion. The approach employs evidence curation and chain-of-thought reasoning in both modalities, with a fusion step aligning their reasoning to produce a coherent final answer and improve verifiability. Across diverse datasets and LLMs, VisDoMRAG yields substantial end-to-end gains (approximately 12–20% over strong baselines), validating the importance of joint multimodal retrieval and cross-modal reasoning in complex document QA. This work advances practical multimodal, multi-document QA by enabling robust access to tables, charts, and slides within real-world collections, and provides a benchmark framework to drive future improvements, including end-to-end training in resource-constrained settings.

Abstract

Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open-source and proprietary large language models, we benchmark state-of-the-art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long-context LLM baselines for end-to-end multimodal document QA by 12-20%.

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

TL;DR

VisDoMBench introduces a first-of-its-kind benchmark for multi-document QA with visually rich content and pairs it with VisDoMRAG, a multimodal Retrieval Augmented Generation framework that runs parallel textual and visual evidence pipelines and enforces consistency through modality fusion. The approach employs evidence curation and chain-of-thought reasoning in both modalities, with a fusion step aligning their reasoning to produce a coherent final answer and improve verifiability. Across diverse datasets and LLMs, VisDoMRAG yields substantial end-to-end gains (approximately 12–20% over strong baselines), validating the importance of joint multimodal retrieval and cross-modal reasoning in complex document QA. This work advances practical multimodal, multi-document QA by enabling robust access to tables, charts, and slides within real-world collections, and provides a benchmark framework to drive future improvements, including end-to-end training in resource-constrained settings.

Abstract

Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open-source and proprietary large language models, we benchmark state-of-the-art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long-context LLM baselines for end-to-end multimodal document QA by 12-20%.

Paper Structure

This paper contains 42 sections, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Multi-document QA systems require inferring relevant context from a large volume of unstructured data, inherently making it a more challenging task than single-document QA.
  • Figure 2: VisDoMRAG: Given a set of documents, VisDoMRAG parallelly performs evidence-driven ➊ Visual RAG and ➋ Textual RAG, prompting the LLMs to answer a query based on the respective retrieved context via Evidence Curation and Chain-of-Thought reasoning. The reasoning chains, and answers from the text and visual pipeline are ensembled together via ➌ Modality Fusion, where the outputs of both the modalities are aligned using consistency analysis on their reasoning chain to arrive at the final answer.
  • Figure 3: Comparison of retrieval performance across datasets, for benchmarked retrievers (BM25, MiniLM, MPNet, BGE1.5, ColPali, ColQwen), at different context window lengths, varying $k \in [1,5,10,20]$.
  • Figure 4: Comparative performance between Long Context and VisDoMRAG (averaged across LLMs) evaluated on different ranges of number of pages $\bar{p} = \sum_{d \in \mathcal{D}} |d|$, with Low ($\bar{p} \le 100$), Medium ($100 < \bar{p} \le 150$), and High ($150 \le \bar{p}$) volumes.
  • Figure 5: Qualitative example from the PaperTab dataset, comparing VisDoMRAG with Unimodal RAG strategies.
  • ...and 14 more figures