HKRAG: Holistic Knowledge Retrieval-Augmented Generation Over Visually-Rich Documents
Anyang Tong, Xiang Niu, ZhiPing Liu, Chang Tian, Yanyan Wei, Zenglin Shi, Meng Wang
TL;DR
This paper tackles the inadequacy of existing multimodal RAG methods for visually rich documents, which overly emphasize salient content and miss fine-print details. It introduces HKRAG, a two-component framework: a Hybrid Masking-based Holistic Retriever that explicitly models both salient and fine-print knowledge, and an Uncertainty-Guided Agentic Generator that uses answer uncertainty to dynamically select and integrate information. Through extensive experiments on open-domain DocumentVQA benchmarks, HKRAG achieves state-of-the-art retrieval and generation performance in both zero-shot and supervised settings, validating the importance of holistic knowledge integration for VRD understanding. The approach reduces hallucinations and improves factual accuracy, offering a practical pathway for robust VRD question answering and related tasks in real-world, visually rich documents.
Abstract
Existing multimodal Retrieval-Augmented Generation (RAG) methods for visually rich documents (VRD) are often biased towards retrieving salient knowledge(e.g., prominent text and visual elements), while largely neglecting the critical fine-print knowledge(e.g., small text, contextual details). This limitation leads to incomplete retrieval and compromises the generator's ability to produce accurate and comprehensive answers. To bridge this gap, we propose HKRAG, a new holistic RAG framework designed to explicitly capture and integrate both knowledge types. Our framework features two key components: (1) a Hybrid Masking-based Holistic Retriever that employs explicit masking strategies to separately model salient and fine-print knowledge, ensuring a query-relevant holistic information retrieval; and (2) an Uncertainty-guided Agentic Generator that dynamically assesses the uncertainty of initial answers and actively decides how to integrate the two distinct knowledge streams for optimal response generation. Extensive experiments on open-domain visual question answering benchmarks show that HKRAG consistently outperforms existing methods in both zero-shot and supervised settings, demonstrating the critical importance of holistic knowledge retrieval for VRD understanding.
