Table of Contents
Fetching ...

HKRAG: Holistic Knowledge Retrieval-Augmented Generation Over Visually-Rich Documents

Anyang Tong, Xiang Niu, ZhiPing Liu, Chang Tian, Yanyan Wei, Zenglin Shi, Meng Wang

TL;DR

This paper tackles the inadequacy of existing multimodal RAG methods for visually rich documents, which overly emphasize salient content and miss fine-print details. It introduces HKRAG, a two-component framework: a Hybrid Masking-based Holistic Retriever that explicitly models both salient and fine-print knowledge, and an Uncertainty-Guided Agentic Generator that uses answer uncertainty to dynamically select and integrate information. Through extensive experiments on open-domain DocumentVQA benchmarks, HKRAG achieves state-of-the-art retrieval and generation performance in both zero-shot and supervised settings, validating the importance of holistic knowledge integration for VRD understanding. The approach reduces hallucinations and improves factual accuracy, offering a practical pathway for robust VRD question answering and related tasks in real-world, visually rich documents.

Abstract

Existing multimodal Retrieval-Augmented Generation (RAG) methods for visually rich documents (VRD) are often biased towards retrieving salient knowledge(e.g., prominent text and visual elements), while largely neglecting the critical fine-print knowledge(e.g., small text, contextual details). This limitation leads to incomplete retrieval and compromises the generator's ability to produce accurate and comprehensive answers. To bridge this gap, we propose HKRAG, a new holistic RAG framework designed to explicitly capture and integrate both knowledge types. Our framework features two key components: (1) a Hybrid Masking-based Holistic Retriever that employs explicit masking strategies to separately model salient and fine-print knowledge, ensuring a query-relevant holistic information retrieval; and (2) an Uncertainty-guided Agentic Generator that dynamically assesses the uncertainty of initial answers and actively decides how to integrate the two distinct knowledge streams for optimal response generation. Extensive experiments on open-domain visual question answering benchmarks show that HKRAG consistently outperforms existing methods in both zero-shot and supervised settings, demonstrating the critical importance of holistic knowledge retrieval for VRD understanding.

HKRAG: Holistic Knowledge Retrieval-Augmented Generation Over Visually-Rich Documents

TL;DR

This paper tackles the inadequacy of existing multimodal RAG methods for visually rich documents, which overly emphasize salient content and miss fine-print details. It introduces HKRAG, a two-component framework: a Hybrid Masking-based Holistic Retriever that explicitly models both salient and fine-print knowledge, and an Uncertainty-Guided Agentic Generator that uses answer uncertainty to dynamically select and integrate information. Through extensive experiments on open-domain DocumentVQA benchmarks, HKRAG achieves state-of-the-art retrieval and generation performance in both zero-shot and supervised settings, validating the importance of holistic knowledge integration for VRD understanding. The approach reduces hallucinations and improves factual accuracy, offering a practical pathway for robust VRD question answering and related tasks in real-world, visually rich documents.

Abstract

Existing multimodal Retrieval-Augmented Generation (RAG) methods for visually rich documents (VRD) are often biased towards retrieving salient knowledge(e.g., prominent text and visual elements), while largely neglecting the critical fine-print knowledge(e.g., small text, contextual details). This limitation leads to incomplete retrieval and compromises the generator's ability to produce accurate and comprehensive answers. To bridge this gap, we propose HKRAG, a new holistic RAG framework designed to explicitly capture and integrate both knowledge types. Our framework features two key components: (1) a Hybrid Masking-based Holistic Retriever that employs explicit masking strategies to separately model salient and fine-print knowledge, ensuring a query-relevant holistic information retrieval; and (2) an Uncertainty-guided Agentic Generator that dynamically assesses the uncertainty of initial answers and actively decides how to integrate the two distinct knowledge streams for optimal response generation. Extensive experiments on open-domain visual question answering benchmarks show that HKRAG consistently outperforms existing methods in both zero-shot and supervised settings, demonstrating the critical importance of holistic knowledge retrieval for VRD understanding.

Paper Structure

This paper contains 11 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: We demonstrate that both salient knowledge and fine-print knowledge are critical for retrieval and generation. Only by jointly leveraging both can we produce reliable answers and effectively mitigate hallucinations.
  • Figure 2: The proposed HKRAG includes (a) Hybrid Masking-based Holistic Retriever and (b) Uncertainty-Guided Agentic Generator.
  • Figure 3: Performance of (a) VDocRAG and (b) HKRAG on DUDE. We present the distribution of queries across two categories: those that retrieved the correct document in the top-3 position (“correct retrieval”), and those that provided the correct answer given the top-3 retrieved documents (“correct generation”).
  • Figure 4: Comparison of VidoRAG wang2025vidorag and our HKRAG across different sizes within the same series. The shaded area represents the gap between VidoRAG and HKRAG.
  • Figure 5: Qualitative results of our HKRAG compared to state-of-the-art VDocRAG tanaka2025vdocrag on open-domain visually rich documents.

Theorems & Definitions (2)

  • Definition 1: Low-uncertainty Query-document Pair, LQP
  • Definition 2: High-uncertainty query-document pair,HQP