Table of Contents
Fetching ...

Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

Aoran Gan, Hao Yu, Kai Zhang, Qi Liu, Wenyu Yan, Zhenya Huang, Shiwei Tong, Guoping Hu

TL;DR

This survey delivers the first comprehensive review of Retrieval-Augmented Generation (RAG) evaluation in the era of Large Language Models, distinguishing internal component evaluation from external system-level assessment. It systematically surveys traditional IR/NLG metrics alongside emerging LLM-based evaluation methods, and aggregates datasets and frameworks to map the evaluation landscape. A meta-analysis of 582 papers reveals that internal evaluation remains dominant, while safety-focused external evaluation is underexplored and costly, with a clear upward trend in LLM-driven judgments. The work provides practical guidance and resources to advance robust, scalable RAG evaluation and identifies key directions for future research.

Abstract

Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and generation components, as well as their dependence on dynamic knowledge sources in the LLM era. In response, this paper provides a comprehensive survey of RAG evaluation methods and frameworks, systematically reviewing traditional and emerging evaluation approaches, for system performance, factual accuracy, safety, and computational efficiency in the LLM era. We also compile and categorize the RAG-specific datasets and evaluation frameworks, conducting a meta-analysis of evaluation practices in high-impact RAG research. To the best of our knowledge, this work represents the most comprehensive survey for RAG evaluation, bridging traditional and LLM-driven methods, and serves as a critical resource for advancing RAG development.

Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

TL;DR

This survey delivers the first comprehensive review of Retrieval-Augmented Generation (RAG) evaluation in the era of Large Language Models, distinguishing internal component evaluation from external system-level assessment. It systematically surveys traditional IR/NLG metrics alongside emerging LLM-based evaluation methods, and aggregates datasets and frameworks to map the evaluation landscape. A meta-analysis of 582 papers reveals that internal evaluation remains dominant, while safety-focused external evaluation is underexplored and costly, with a clear upward trend in LLM-driven judgments. The work provides practical guidance and resources to advance robust, scalable RAG evaluation and identifies key directions for future research.

Abstract

Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and generation components, as well as their dependence on dynamic knowledge sources in the LLM era. In response, this paper provides a comprehensive survey of RAG evaluation methods and frameworks, systematically reviewing traditional and emerging evaluation approaches, for system performance, factual accuracy, safety, and computational efficiency in the LLM era. We also compile and categorize the RAG-specific datasets and evaluation frameworks, conducting a meta-analysis of evaluation practices in high-impact RAG research. To the best of our knowledge, this work represents the most comprehensive survey for RAG evaluation, bridging traditional and LLM-driven methods, and serves as a critical resource for advancing RAG development.

Paper Structure

This paper contains 24 sections, 29 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The workflow of the RAG system and component implementation in the LLM era.
  • Figure 2: The evaluation target of the Retrieval and Generation component in RAG.
  • Figure 3: Statistics on the distribution of RAG studies across four key areas: retrieval, generation, safety, and efficiency. A paper may utilize evaluation methods in more than one areas.
  • Figure 4: Frequency statistics wordcloud of evaluation metrics in RAG studies. The LLM-based methods are categorized based on the targets and presented with the suffix '-LLM'. F-score refers to the expanded F1-score.
  • Figure 5: The number of papers explicitly mentioning LLM-based evaluation on RAG. The 2025 H1 collection is up to March 31st.