Evaluation of Retrieval-Augmented Generation: A Survey
Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, Zhaofeng Liu
TL;DR
This paper addresses the challenge of evaluating Retrieval-Augmented Generation (RAG) systems, which combine retrieval and generation and rely on dynamic external knowledge. It introduces Auepora, a unified evaluation process that links evaluation targets, datasets, and metrics to systematically assess retrieval, generation, and the integrated RAG pipeline. The work analyzes existing benchmarks, identifies gaps, and proposes guidelines to create robust, domain-aware, and resource-conscious evaluations. By structuring evaluation around explicit EO-GT pairings and task-specific metrics, the framework aims to improve the reliability and comparability of RAG benchmarks in real-world settings.
Abstract
Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand these challenges, we conduct A Unified Evaluation Process of RAG (Auepora) and aim to provide a comprehensive overview of the evaluation and benchmarks of RAG systems. Specifically, we examine and compare several quantifiable metrics of the Retrieval and Generation components, such as relevance, accuracy, and faithfulness, within the current RAG benchmarks, encompassing the possible output and ground truth pairs. We then analyze the various datasets and metrics, discuss the limitations of current benchmarks, and suggest potential directions to advance the field of RAG benchmarks.
