Table of Contents
Fetching ...

Evaluating Retrieval Quality in Retrieval-Augmented Generation

Alireza Salemi, Hamed Zamani

TL;DR

Evaluating retrievers in retrieval-augmented generation (RAG) is challenging due to expensive end-to-end computation and poor interpretability of which retrieved documents contribute to the output. The authors propose eRAG, a per-document evaluation framework that uses the downstream task performance of the LLM on each candidate document to generate document-level relevance labels G_q[d] = E_M( M(q, {d}), y ) and aggregates these with set-based or ranking metrics to score retrieval lists. Across NQ, TriviaQA, HotpotQA, FEVER, and Wizard of Wikipedia, eRAG achieves higher correlation with downstream RAG performance, with Kendall's $ au$ gains between $0.168$ and $0.494$, and demonstrates robustness to retrieval list size, LLM size, and augmentation strategy, while delivering substantial efficiency gains (approximately $2.47\times$ faster and up to $30$–$48\times$ memory savings) over end-to-end evaluation. The method offers a transparent, scalable alternative for retriever evaluation in RAG and is implemented with publicly available code for broader adoption.

Abstract

Evaluating retrieval-augmented generation (RAG) presents challenges, particularly for retrieval models within these systems. Traditional end-to-end evaluation methods are computationally expensive. Furthermore, evaluation of the retrieval model's performance based on query-document relevance labels shows a small correlation with the RAG system's downstream performance. We propose a novel evaluation approach, eRAG, where each document in the retrieval list is individually utilized by the large language model within the RAG system. The output generated for each document is then evaluated based on the downstream task ground truth labels. In this manner, the downstream performance for each document serves as its relevance label. We employ various downstream task metrics to obtain document-level annotations and aggregate them using set-based or ranking metrics. Extensive experiments on a wide range of datasets demonstrate that eRAG achieves a higher correlation with downstream RAG performance compared to baseline methods, with improvements in Kendall's $τ$ correlation ranging from 0.168 to 0.494. Additionally, eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation.

Evaluating Retrieval Quality in Retrieval-Augmented Generation

TL;DR

Evaluating retrievers in retrieval-augmented generation (RAG) is challenging due to expensive end-to-end computation and poor interpretability of which retrieved documents contribute to the output. The authors propose eRAG, a per-document evaluation framework that uses the downstream task performance of the LLM on each candidate document to generate document-level relevance labels G_q[d] = E_M( M(q, {d}), y ) and aggregates these with set-based or ranking metrics to score retrieval lists. Across NQ, TriviaQA, HotpotQA, FEVER, and Wizard of Wikipedia, eRAG achieves higher correlation with downstream RAG performance, with Kendall's gains between and , and demonstrates robustness to retrieval list size, LLM size, and augmentation strategy, while delivering substantial efficiency gains (approximately faster and up to memory savings) over end-to-end evaluation. The method offers a transparent, scalable alternative for retriever evaluation in RAG and is implemented with publicly available code for broader adoption.

Abstract

Evaluating retrieval-augmented generation (RAG) presents challenges, particularly for retrieval models within these systems. Traditional end-to-end evaluation methods are computationally expensive. Furthermore, evaluation of the retrieval model's performance based on query-document relevance labels shows a small correlation with the RAG system's downstream performance. We propose a novel evaluation approach, eRAG, where each document in the retrieval list is individually utilized by the large language model within the RAG system. The output generated for each document is then evaluated based on the downstream task ground truth labels. In this manner, the downstream performance for each document serves as its relevance label. We employ various downstream task metrics to obtain document-level annotations and aggregate them using set-based or ranking metrics. Extensive experiments on a wide range of datasets demonstrate that eRAG achieves a higher correlation with downstream RAG performance compared to baseline methods, with improvements in Kendall's correlation ranging from 0.168 to 0.494. Additionally, eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation.
Paper Structure (6 sections, 1 equation, 3 figures, 2 tables)

This paper contains 6 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The correlation between evaluation approaches and the LLM's downstream performance varying number of retrieved documents by BM25. T5-small with FiD is used. The metric with the highest correlation in Table \ref{['tab:corr-bm25-contriever']} is used.
  • Figure 2: The correlation between eRAG and the downstream performance of different LLM sizes. In this experiment, T5-small (60M parameters) and T5-base (220M parameters) with FiD are used. The documents are retrieved using BM25.
  • Figure 3: The correlation between eRAG and the downstream performance of FiD and IPA LLMs. T5-small with 10 documents retrieved by BM25 is used. The number of documents is chosen based on the limitations of the input size in IPA.