Table of Contents
Fetching ...

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao

TL;DR

ViDoRAG tackles retrieval, comprehension, and reasoning over visually rich documents by introducing ViDoSeek, a large-scale dataset, and a multi-agent RAG framework with Gaussian Mixture Model-based adaptive retrieval and iterative reasoning. The Seeker, Inspector, and Answer agents perform coarse-to-fine exploration, reflection, and synthesis, enhancing robustness to noise and enabling test-time scaling. Empirical results show ViDoRAG achieves over 10% improvement on ViDoSeek compared to strong baselines and demonstrates favorable latency-accuracy trade-offs; ablations and analysis highlight the contributions of hybrid retrieval and multi-agent generation. The work advances practical RAG for documents with charts, tables, and layouts and includes open-source code.

Abstract

Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark. The code is available at https://github.com/Alibaba-NLP/ViDoRAG.

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

TL;DR

ViDoRAG tackles retrieval, comprehension, and reasoning over visually rich documents by introducing ViDoSeek, a large-scale dataset, and a multi-agent RAG framework with Gaussian Mixture Model-based adaptive retrieval and iterative reasoning. The Seeker, Inspector, and Answer agents perform coarse-to-fine exploration, reflection, and synthesis, enhancing robustness to noise and enabling test-time scaling. Empirical results show ViDoRAG achieves over 10% improvement on ViDoSeek compared to strong baselines and demonstrates favorable latency-accuracy trade-offs; ablations and analysis highlight the contributions of hybrid retrieval and multi-agent generation. The work advances practical RAG for documents with charts, tables, and layouts and includes open-source code.

Abstract

Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark. The code is available at https://github.com/Alibaba-NLP/ViDoRAG.

Paper Structure

This paper contains 54 sections, 8 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Comparison of our work with the existing datasets and methods. (a) In traditional datasets, each query must be paired with specific images or documents. In our ViDoSeek, each query can obtain a unique answer within the large corpus. (b) Our ViDoRAG is a multi-agent, coarse-to-fine framework specifically optimized for visually rich documents.
  • Figure 2: Data Construction pipeline. (a) We sample and filter documents according to the requirements to obtain candidates. (b) Then experts construct the initial query from different contents. (c) After that, we prompt GPT-4 to directly determine whether the query is a general query. The remaining queries are carefully reviewed with top-K recall images. (d) Finally, unqualified queries are refined paired with golden image by GPT-4o.
  • Figure 3: ViDoRAG Framework.
  • Figure 4: Retrieval performance across different retrievers and hybrid retrieval, along with ablations on GMM.
  • Figure 5: Latency Analysis on Generation.
  • ...and 9 more figures