Table of Contents
Fetching ...

Evaluation of Retrieval-Augmented Generation: A Survey

Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, Zhaofeng Liu

TL;DR

This paper addresses the challenge of evaluating Retrieval-Augmented Generation (RAG) systems, which combine retrieval and generation and rely on dynamic external knowledge. It introduces Auepora, a unified evaluation process that links evaluation targets, datasets, and metrics to systematically assess retrieval, generation, and the integrated RAG pipeline. The work analyzes existing benchmarks, identifies gaps, and proposes guidelines to create robust, domain-aware, and resource-conscious evaluations. By structuring evaluation around explicit EO-GT pairings and task-specific metrics, the framework aims to improve the reliability and comparability of RAG benchmarks in real-world settings.

Abstract

Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand these challenges, we conduct A Unified Evaluation Process of RAG (Auepora) and aim to provide a comprehensive overview of the evaluation and benchmarks of RAG systems. Specifically, we examine and compare several quantifiable metrics of the Retrieval and Generation components, such as relevance, accuracy, and faithfulness, within the current RAG benchmarks, encompassing the possible output and ground truth pairs. We then analyze the various datasets and metrics, discuss the limitations of current benchmarks, and suggest potential directions to advance the field of RAG benchmarks.

Evaluation of Retrieval-Augmented Generation: A Survey

TL;DR

This paper addresses the challenge of evaluating Retrieval-Augmented Generation (RAG) systems, which combine retrieval and generation and rely on dynamic external knowledge. It introduces Auepora, a unified evaluation process that links evaluation targets, datasets, and metrics to systematically assess retrieval, generation, and the integrated RAG pipeline. The work analyzes existing benchmarks, identifies gaps, and proposes guidelines to create robust, domain-aware, and resource-conscious evaluations. By structuring evaluation around explicit EO-GT pairings and task-specific metrics, the framework aims to improve the reliability and comparability of RAG benchmarks in real-world settings.

Abstract

Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand these challenges, we conduct A Unified Evaluation Process of RAG (Auepora) and aim to provide a comprehensive overview of the evaluation and benchmarks of RAG systems. Specifically, we examine and compare several quantifiable metrics of the Retrieval and Generation components, such as relevance, accuracy, and faithfulness, within the current RAG benchmarks, encompassing the possible output and ground truth pairs. We then analyze the various datasets and metrics, discuss the limitations of current benchmarks, and suggest potential directions to advance the field of RAG benchmarks.

Paper Structure

This paper contains 25 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The structure of the RAG system with retrieval and generation components and corresponding four phrases: indexing, search, prompting and inferencing. The pairs of "Evaluable Outputs" (EOs) and "Ground Truths" (GTs) are highlighted in read frame and green frame, with brown dashed arrows.
  • Figure 2: The Target modular of the Auepora.