Table of Contents
Fetching ...

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Ryota Tanaka, Taichi Iki, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Jun Suzuki

TL;DR

VDocRAG tackles the challenge of answering questions over visually-rich documents by operating directly on document images rather than parsed text. The framework combines a vision-language retrieval module (VDocRetriever) with a generation module (VDocGenerator) and is underpinned by two self-supervised pre-training tasks that compress visual information into dense, text-aligned representations. The authors introduce OpenDocVQA, a large open-domain dataset that supports retrieval-augmented QA across diverse document types, including multi-hop reasoning. Empirical results show that VDocRAG outperforms text-based RAG and demonstrates strong generalization, supported by ablations highlighting the benefits of the pre-training tasks, LVLM backbones, and dataset contributions. The work advances practical retrieval-augmented reasoning for real-world documents and provides a foundation for further efficiency and multimodal integration improvements.

Abstract

We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

TL;DR

VDocRAG tackles the challenge of answering questions over visually-rich documents by operating directly on document images rather than parsed text. The framework combines a vision-language retrieval module (VDocRetriever) with a generation module (VDocGenerator) and is underpinned by two self-supervised pre-training tasks that compress visual information into dense, text-aligned representations. The authors introduce OpenDocVQA, a large open-domain dataset that supports retrieval-augmented QA across diverse document types, including multi-hop reasoning. Empirical results show that VDocRAG outperforms text-based RAG and demonstrates strong generalization, supported by ablations highlighting the benefits of the pre-training tasks, LVLM backbones, and dataset contributions. The work advances practical retrieval-augmented reasoning for real-world documents and provides a foundation for further efficiency and multimodal integration improvements.

Abstract

We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.

Paper Structure

This paper contains 55 sections, 4 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: Our framework of VDocRAG and examples from OpenDocVQA. VDocRAG consists of VDocRetirver and VDocGenerator, which can retrieve relevant documents and generate answers by understanding the original appearance of documents.
  • Figure 2: Process of creating multi-hop DocumentVQA questions.
  • Figure 3: Overview of our VDocRAG model. VDocRetriever retrieves document images related to the question from a corpus of document images, and VDocGenerator uses these retrieved images to generate the answer.
  • Figure 4: Our pre-training tasks using unlabeled documents and fine-tuning in VDocRetriever. The RCR task retrieves relevant images given corresponding OCR tokens, and the RCG task outputs OCR tokens by paying attention to only the <EOS> token.
  • Figure 5: Performance under different document lengths on InfoVQA (single-pool setting).
  • ...and 6 more figures