Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Jun Chen; Dannong Xu; Junjie Fei; Chun-Mei Feng; Mohamed Elhoseiny

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Jun Chen, Dannong Xu, Junjie Fei, Chun-Mei Feng, Mohamed Elhoseiny

TL;DR

The paper addresses the difficulty of vision-language reasoning over large-scale visual document collections by introducing DocHaystack and InfoHaystack, benchmarks that scale to up to 1,000 documents per query and enforce unique, document-specific answers via a rigorous data-curation workflow. It then proposes V-RAG, a vision-centric retrieval-augmented generation framework that ensembles multiple vision encoders and a dedicated LMM-based relevance module to efficiently retrieve relevant documents and generate answers. Across these benchmarks, V-RAG delivers notable improvements in Recall@1 over prior baselines and enhances VQA performance when integrated with large multimodal models like GPT-4o, demonstrating the practicality of large-scale visual document understanding. The work provides datasets and code to advance research in scalable visual document retrieval and reasoning, with potential applications in large-scale visual search and document analysis.

Abstract

Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images. Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module. V-RAG sets a new standard, with a 9% and 11% improvement in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, compared to the previous best baseline models. Additionally, integrating V-RAG with LMMs enables them to efficiently operate across thousands of images, yielding significant improvements on our DocHaystack and InfoHaystack benchmarks. Our code and datasets are available at https://github.com/Vision-CAIR/dochaystacks

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

TL;DR

Abstract

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)