Table of Contents
Fetching ...

FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

Samuel Hildebrand, Curtis Taylor, Sean Oesch, James M Ghawaly, Amir Sadovnik, Ryan Shivers, Brandon Schreiber, Kevin Kurian

TL;DR

FATHOMS-RAG provides a reproducible benchmark to evaluate end-to-end multimodal RAG pipelines on scientific PDFs. It introduces a 93-question dataset, a phrase-level recall scoring scheme, and a nearest-neighbor classifier to separate abstentions from hallucinations, enabling robust pipeline-level evaluation. The study compares open-source pipelines (text-only LlamaIndex; Docling with EasyOCR) and closed-source APIs (Claude Sonnet-4, Gemini 2.5 Flash, GPT-4.1, GPT-4o), revealing strong advantages for closed systems and notable gains from OCR-based ingestion, yet persistent challenges in cross-document multimodal reasoning. Overall, the work provides a lightweight, reproducible framework for benchmarking and guiding future improvements in trustworthy retrieval-augmented multimodal systems.

Abstract

Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree").

FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

TL;DR

FATHOMS-RAG provides a reproducible benchmark to evaluate end-to-end multimodal RAG pipelines on scientific PDFs. It introduces a 93-question dataset, a phrase-level recall scoring scheme, and a nearest-neighbor classifier to separate abstentions from hallucinations, enabling robust pipeline-level evaluation. The study compares open-source pipelines (text-only LlamaIndex; Docling with EasyOCR) and closed-source APIs (Claude Sonnet-4, Gemini 2.5 Flash, GPT-4.1, GPT-4o), revealing strong advantages for closed systems and notable gains from OCR-based ingestion, yet persistent challenges in cross-document multimodal reasoning. Overall, the work provides a lightweight, reproducible framework for benchmarking and guiding future improvements in trustworthy retrieval-augmented multimodal systems.

Abstract

Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree").

Paper Structure

This paper contains 20 sections, 7 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Answer Retrieval Accuracy for LlamaIndex Text Only RAG pipeline. (each axis normalized) Text-Only retrieval reaches 63% accuracy, however, other categories recieve significantly lower scores.
  • Figure 2: Calculated Hallucination Rate for LlamaIndex Text Only RAG pipeline. (Each axis normalized) Hallucinations reach as high as 86% since this pipeline has no capability to process images.
  • Figure 3: Answer Retrieval Accuracy for Docling and EasyOCR RAG pipeline. (Each axis normalized)
  • Figure 4: Calculated Hallucination Rate for Docling and EasyOCR RAG pipeline (Each axis normalized)
  • Figure 5: Answer Retrieval Accuracy for RAG pipelines of Closed-Source APIs. (Each axis normalized) Significantly higher scores in all categories.
  • ...and 1 more figures