MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation

Chanhee Park; Hyeonseok Moon; Chanjun Park; Heuiseok Lim

MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation

Chanhee Park, Hyeonseok Moon, Chanjun Park, Heuiseok Lim

TL;DR

MIRAGE addresses the challenge of evaluating retrieval-augmented generation (RAG) systems by providing a compact QA benchmark with a dedicated retrieval pool and four RAG adaptability metrics. The dataset (7,560 QA items linked to 37,800 chunks) and a multi-stage filtering pipeline enable precise, reproducible evaluation of both retrievers and LLMs under varying context regimes (Base, Oracle, Mixed). Through extensive experiments with diverse retrievers and LLMs, the paper reveals nuanced interactions—noise in retrieved context can degrade or be mitigated by model choice—and shows how retrieval can boost smaller models toward larger ones. The work offers publicly available data and code, enabling researchers to analyze and optimize RAG alignments in practical settings.

Abstract

Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs) through the incorporation of external knowledge. However, the evaluation of RAG systems remains a challenge, due to the intricate interplay between retrieval and generation components. This limitation has resulted in a scarcity of benchmarks that facilitate a detailed, component-specific assessment. In this work, we present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks. We also introduce novel evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions such as noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Through comprehensive experiments across various retriever-LLM configurations, we provide new insights into the optimal alignment of model pairs and the nuanced dynamics within RAG systems. The dataset and evaluation code are publicly available, allowing for seamless integration and customization in diverse research settings\footnote{The MIRAGE code and data are available at https://github.com/nlpai-lab/MIRAGE.

MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation

TL;DR

Abstract

MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)