Table of Contents
Fetching ...

REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark

Navve Wasserman, Roi Pony, Oshri Naparstek, Adi Raz Goldfarb, Eli Schwartz, Udi Barzelay, Leonid Karlinsky

TL;DR

REAL-MM-RAG-Bench presents a real-world, multi-modal retrieval benchmark for RAG systems, addressing critical gaps in realism, difficulty, and labeling accuracy. It deploys an automated pipeline to collect long, multi-modal documents, generate and filter naturalistic RAG-oriented queries, and verify correct labeling, while incorporating multi-level query rephrasing to test semantic understanding. The authors introduce two targeted training datasets—a rephrased dataset and a finance-table-heavy dataset—demonstrating that fine-tuning on these data yields state-of-the-art retrieval performance on the benchmark. The work provides a practical framework for evaluating and improving multi-modal RAG retrieval, along with training resources that mitigate current weaknesses such as table comprehension and rephrasing robustness.

Abstract

Accurate multi-modal document retrieval is crucial for Retrieval-Augmented Generation (RAG), yet existing benchmarks do not fully capture real-world challenges with their current design. We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval: (i) multi-modal documents, (ii) enhanced difficulty, (iii) Realistic-RAG queries and (iv) accurate labeling. Additionally, we propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing. To mitigate these shortcomings, we curate a rephrased training set and introduce a new finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models to achieve state-of-the-art retrieval performance on REAL-MM-RAG benchmark. Our work offers a better way to evaluate and improve retrieval in multi-modal RAG systems while also providing training data and models that address current limitations.

REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark

TL;DR

REAL-MM-RAG-Bench presents a real-world, multi-modal retrieval benchmark for RAG systems, addressing critical gaps in realism, difficulty, and labeling accuracy. It deploys an automated pipeline to collect long, multi-modal documents, generate and filter naturalistic RAG-oriented queries, and verify correct labeling, while incorporating multi-level query rephrasing to test semantic understanding. The authors introduce two targeted training datasets—a rephrased dataset and a finance-table-heavy dataset—demonstrating that fine-tuning on these data yields state-of-the-art retrieval performance on the benchmark. The work provides a practical framework for evaluating and improving multi-modal RAG retrieval, along with training resources that mitigate current weaknesses such as table comprehension and rephrasing robustness.

Abstract

Accurate multi-modal document retrieval is crucial for Retrieval-Augmented Generation (RAG), yet existing benchmarks do not fully capture real-world challenges with their current design. We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval: (i) multi-modal documents, (ii) enhanced difficulty, (iii) Realistic-RAG queries and (iv) accurate labeling. Additionally, we propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing. To mitigate these shortcomings, we curate a rephrased training set and introduce a new finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models to achieve state-of-the-art retrieval performance on REAL-MM-RAG benchmark. Our work offers a better way to evaluate and improve retrieval in multi-modal RAG systems while also providing training data and models that address current limitations.

Paper Structure

This paper contains 34 sections, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Proposed Real-MM-RAG Benchmark
  • Figure 2: Benchmark Construction Pipeline
  • Figure 3: Table-Focused Training Improves Financial Benchmarks. Fine-tuning with our proposed table-heavy training set, combined with the ColPali training set (both in their original and rephrased versions) significantly enhances performance on financial benchmarks (results shown for rephrasing level 3).
  • Figure 3: Query Rephrasing Effect. This table presents the NDCG@5 scores averaged across all benchmarks for different models and rephrasing levels. '0' represents no rephrasing, while '3' indicates significant rephrasing (see \ref{['Table:rephrasing_benchmarks']} for full results).
  • Figure 4: Fine-Tuning on Rephrased Training Set. We compare the NDCG@5 scores across rephrasing levels for baseline models (ColPali and ColQwen) against our fine-tuned models (RobCol). The results demonstrate that fine-tuning with our rephrased training data significantly enhances rephrasing robustness for both ColPali and ColQwen.
  • ...and 8 more figures