Table of Contents
Fetching ...

IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering

Connor Shorten, Augustas Skaburskas, Daniel M. Jones, Charles Pierse, Roberto Esposito, John Trengrove, Etienne Dilocker, Bob van Luijt

TL;DR

This work introduces IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page, and analyzes the complementary limitations of unimodal text and image representations and identifies question types that require one modality over the other.

Abstract

AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibit complementary failures, enabling multimodal hybrid search to outperform either alone, achieving 49% Recall@1, 81% Recall@5, and 95% Recall@20. We further evaluate efficiency-performance tradeoffs with MUVERA and assess multiple multi-vector image embedding models. Among closed-source models, Cohere Embed v4 page image embeddings outperform Voyage 3 Large text embeddings and all tested open-source models, achieving 58% Recall@1, 87% Recall@5, and 97% Recall@20. For question answering, text-based RAG systems achieved higher ground-truth alignment than image-based systems (0.82 vs. 0.71), and both benefit substantially from increased retrieval depth, with multi-document retrieval outperforming oracle single-document retrieval. We analyze the complementary limitations of unimodal text and image representations and identify question types that require one modality over the other. The IRPAPERS dataset and all experimental code are publicly available.

IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering

TL;DR

This work introduces IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page, and analyzes the complementary limitations of unimodal text and image representations and identifies question types that require one modality over the other.

Abstract

AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibit complementary failures, enabling multimodal hybrid search to outperform either alone, achieving 49% Recall@1, 81% Recall@5, and 95% Recall@20. We further evaluate efficiency-performance tradeoffs with MUVERA and assess multiple multi-vector image embedding models. Among closed-source models, Cohere Embed v4 page image embeddings outperform Voyage 3 Large text embeddings and all tested open-source models, achieving 58% Recall@1, 87% Recall@5, and 97% Recall@20. For question answering, text-based RAG systems achieved higher ground-truth alignment than image-based systems (0.82 vs. 0.71), and both benefit substantially from increased retrieval depth, with multi-document retrieval outperforming oracle single-document retrieval. We analyze the complementary limitations of unimodal text and image representations and identify question types that require one modality over the other. The IRPAPERS dataset and all experimental code are publicly available.
Paper Structure (27 sections, 6 figures, 7 tables)

This paper contains 27 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: An overview of our experimental methodology. We compare retrieval systems operating on embedded page images with ColModernVBERT and MUVERA encoding to embedded text transcriptions that leverage GPT-4.1 OCR and a hybrid search retrieval strategy combining Arctic 2.0 dense text embeddings with BM25. We find the highest performance with multimodal hybrid search, combining normalized scores from all three retrieval systems pictured.
  • Figure 2: Results of our retrieval test. Multimodal hybrid search consistently outperforms single-modality retrieval across all recall levels, highlighting the complementary strengths of text and image representations.
  • Figure 3: Comparison of Multimodal Hybrid Search fusion strategies combining hybrid text search (Arctic 2.0 + BM25) with ColModernVBERT image embeddings. For each value of $\alpha$, we report results for both RRF and RSF, visualized as distinct fusion strategies, where $\alpha$=0 uses text only and $\alpha$=1 uses images only. The highest performing configuration(s) for each respective recall target (@1, @5, and @20) are highlighted in pink.
  • Figure 4: Comparison of Multimodal Hybrid Search fusion strategies with Hybrid Search using Voyage 3 Large and BM25 for Text and Cohere Embed v4.0 for Images. We compare Reciprocal Rank Fusion (RRF) and Relative Score Fusion (RSF) across different values of $\alpha$, where $\alpha$=0 uses text only and $\alpha$=1 uses images only. The highest performing configurations for each respective recall target (@1, @5, and @20) are highlighted in pink.
  • Figure 5: A heatmap visualization of MaxSim scoring for the query: What three categories does RaFe identify for how query rewriting improvements occur based on case studies?
  • ...and 1 more figures