Table of Contents
Fetching ...

MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation

Chanhee Park, Hyeonseok Moon, Chanjun Park, Heuiseok Lim

TL;DR

MIRAGE addresses the challenge of evaluating retrieval-augmented generation (RAG) systems by providing a compact QA benchmark with a dedicated retrieval pool and four RAG adaptability metrics. The dataset (7,560 QA items linked to 37,800 chunks) and a multi-stage filtering pipeline enable precise, reproducible evaluation of both retrievers and LLMs under varying context regimes (Base, Oracle, Mixed). Through extensive experiments with diverse retrievers and LLMs, the paper reveals nuanced interactions—noise in retrieved context can degrade or be mitigated by model choice—and shows how retrieval can boost smaller models toward larger ones. The work offers publicly available data and code, enabling researchers to analyze and optimize RAG alignments in practical settings.

Abstract

Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs) through the incorporation of external knowledge. However, the evaluation of RAG systems remains a challenge, due to the intricate interplay between retrieval and generation components. This limitation has resulted in a scarcity of benchmarks that facilitate a detailed, component-specific assessment. In this work, we present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks. We also introduce novel evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions such as noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Through comprehensive experiments across various retriever-LLM configurations, we provide new insights into the optimal alignment of model pairs and the nuanced dynamics within RAG systems. The dataset and evaluation code are publicly available, allowing for seamless integration and customization in diverse research settings\footnote{The MIRAGE code and data are available at https://github.com/nlpai-lab/MIRAGE.

MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation

TL;DR

MIRAGE addresses the challenge of evaluating retrieval-augmented generation (RAG) systems by providing a compact QA benchmark with a dedicated retrieval pool and four RAG adaptability metrics. The dataset (7,560 QA items linked to 37,800 chunks) and a multi-stage filtering pipeline enable precise, reproducible evaluation of both retrievers and LLMs under varying context regimes (Base, Oracle, Mixed). Through extensive experiments with diverse retrievers and LLMs, the paper reveals nuanced interactions—noise in retrieved context can degrade or be mitigated by model choice—and shows how retrieval can boost smaller models toward larger ones. The work offers publicly available data and code, enabling researchers to analyze and optimize RAG alignments in practical settings.

Abstract

Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs) through the incorporation of external knowledge. However, the evaluation of RAG systems remains a challenge, due to the intricate interplay between retrieval and generation components. This limitation has resulted in a scarcity of benchmarks that facilitate a detailed, component-specific assessment. In this work, we present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks. We also introduce novel evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions such as noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Through comprehensive experiments across various retriever-LLM configurations, we provide new insights into the optimal alignment of model pairs and the nuanced dynamics within RAG systems. The dataset and evaluation code are publicly available, allowing for seamless integration and customization in diverse research settings\footnote{The MIRAGE code and data are available at https://github.com/nlpai-lab/MIRAGE.

Paper Structure

This paper contains 44 sections, 6 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Examples for four RAG Adaptability metrics. By analyzing model responses across three different settings, we assess the model's ability to utilize relevant information while disregarding irrelevant noise.
  • Figure 2: Data Filtering Process for MIRAGE
  • Figure 3: Number of data points per dataset
  • Figure 4: Number of data points per relevant chunks
  • Figure 5: The command-line screen used for the annotation process.