Table of Contents
Fetching ...

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Jun Chen, Dannong Xu, Junjie Fei, Chun-Mei Feng, Mohamed Elhoseiny

TL;DR

The paper addresses the difficulty of vision-language reasoning over large-scale visual document collections by introducing DocHaystack and InfoHaystack, benchmarks that scale to up to 1,000 documents per query and enforce unique, document-specific answers via a rigorous data-curation workflow. It then proposes V-RAG, a vision-centric retrieval-augmented generation framework that ensembles multiple vision encoders and a dedicated LMM-based relevance module to efficiently retrieve relevant documents and generate answers. Across these benchmarks, V-RAG delivers notable improvements in Recall@1 over prior baselines and enhances VQA performance when integrated with large multimodal models like GPT-4o, demonstrating the practicality of large-scale visual document understanding. The work provides datasets and code to advance research in scalable visual document retrieval and reasoning, with potential applications in large-scale visual search and document analysis.

Abstract

Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images. Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module. V-RAG sets a new standard, with a 9% and 11% improvement in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, compared to the previous best baseline models. Additionally, integrating V-RAG with LMMs enables them to efficiently operate across thousands of images, yielding significant improvements on our DocHaystack and InfoHaystack benchmarks. Our code and datasets are available at https://github.com/Vision-CAIR/dochaystacks

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

TL;DR

The paper addresses the difficulty of vision-language reasoning over large-scale visual document collections by introducing DocHaystack and InfoHaystack, benchmarks that scale to up to 1,000 documents per query and enforce unique, document-specific answers via a rigorous data-curation workflow. It then proposes V-RAG, a vision-centric retrieval-augmented generation framework that ensembles multiple vision encoders and a dedicated LMM-based relevance module to efficiently retrieve relevant documents and generate answers. Across these benchmarks, V-RAG delivers notable improvements in Recall@1 over prior baselines and enhances VQA performance when integrated with large multimodal models like GPT-4o, demonstrating the practicality of large-scale visual document understanding. The work provides datasets and code to advance research in scalable visual document retrieval and reasoning, with potential applications in large-scale visual search and document analysis.

Abstract

Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images. Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module. V-RAG sets a new standard, with a 9% and 11% improvement in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, compared to the previous best baseline models. Additionally, integrating V-RAG with LMMs enables them to efficiently operate across thousands of images, yielding significant improvements on our DocHaystack and InfoHaystack benchmarks. Our code and datasets are available at https://github.com/Vision-CAIR/dochaystacks

Paper Structure

This paper contains 10 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison between previous and proposed benchmarks. Given a question as input, all benchmarks aim to retrieve relevant images from an image pool to correctly answer the question. Unlike prior benchmarks like RetVQA penamakuri2023retvqa and WebVQA chang2022webqa, which structure their datasets by pairing each question with a limited set of images (typically $\leq$ 30), our benchmarks, DocHaystack and InfoHaystack, map each question to a substantially larger document collection, scaling up to 1,000 visual documents. This expanded scope more accurately represents large-scale document retrieval scenarios and offers a greater challenge in retrieval accuracy and visual question answering.
  • Figure 2: Data Curation Pipeline. Our benchmarks are curated based on the DocVQA and InfographicVQA datasets, following a three-step filtering process to obtain document-specific question-answer pairs. In Step 1, we filter out general questions (e.g., "What is the table number?"), as these could be answered by multiple documents and lack specificity. Step 2 involves a manual review by human annotators to further remove general questions. In Step 3, we eliminate generic-knowledge questions (e.g., "How many sports were in the 2008 Beijing Paralympic Games?") that can be answered directly by large language models without requiring image input."
  • Figure 3: The V-RAG pipeline workflow. In the top section, a vision encoder ensemble is used, combining multiple vision models—CLIP, SigLIP, and OpenCLIP—to process a large document haystack. Each encoder computes similarity scores, which are averaged into $Sim_{\text{avg}}$. The top m documents, based on these scores, are selected for further analysis. In the bottom right, the LMM-Filter Module utilizes a pretrained LMM to assess whether each selected document can potentially answer the posed question. This filtering step removes documents that do not match, retaining only relevant ones. Finally, the top k most relevant images are input into the LMM along with the original question $q$ to generate a specific answer.
  • Figure 4: Question type analysis. We analyze the distribution of question types of DocHaystack and InfoHaystack. Each benchmark categorizes the data into 5 different types.
  • Figure 5: Top-k selection ablation analysis for LMM-VQA. We demonstrate the results for LLaVA, Qwen2-VL, GPT-4o and also the finetuned Qwen2-VL model on the DocHaystack-100/1000 and InfoHaystack-100/1000 benchmarks. All the models are integrated with our V-RAG framework. We show the VQA accuracy performance for each ablation.