Table of Contents
Fetching ...

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Kunlun Zhu, Yifan Luo, Dingling Xu, Yukun Yan, Zhenghao Liu, Shi Yu, Ruobing Wang, Shuo Wang, Yishan Li, Nan Zhang, Xu Han, Zhiyuan Liu, Maosong Sun

TL;DR

RAGEval presents a schema-driven framework for automatically generating scenario-specific RAG evaluation datasets, addressing coverage and annotation costs. It introduces three ground-truth, keypoint-based metrics—Completeness, Hallucination, and Irrelevance—alongside retrieval metrics Recall and EIR to assess both retrieval and generation quality. The DragonBall benchmark, spanning finance, law, and medicine in CN and EN, demonstrates that RAGEval-produced data and metrics yield robust, human-aligned evaluations, with GPT-4o often leading in completion while open models close the gap in certain settings. The framework offers a practical pathway to real-world, domain-specific RAG evaluation and provides a foundation for broader, multilingual, and scenario-rich assessment of RAG systems.

Abstract

Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance to evaluate LLM generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

TL;DR

RAGEval presents a schema-driven framework for automatically generating scenario-specific RAG evaluation datasets, addressing coverage and annotation costs. It introduces three ground-truth, keypoint-based metrics—Completeness, Hallucination, and Irrelevance—alongside retrieval metrics Recall and EIR to assess both retrieval and generation quality. The DragonBall benchmark, spanning finance, law, and medicine in CN and EN, demonstrates that RAGEval-produced data and metrics yield robust, human-aligned evaluations, with GPT-4o often leading in completion while open models close the gap in certain settings. The framework offers a practical pathway to real-world, domain-specific RAG evaluation and provides a foundation for broader, multilingual, and scenario-rich assessment of RAG systems.

Abstract

Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance to evaluate LLM generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.
Paper Structure (35 sections, 5 equations, 15 figures, 13 tables)

This paper contains 35 sections, 5 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: The challenges of building scenario-specific RAG evaluation datasets: scenario coverage and annotation costs.
  • Figure 2: RAGEval Progress: ➀ summarizing a schema containing specific knowledge from seed documents. ➁ filling in factual information based on this schema to generate diverse configurations. ➂ generating documents according to the configurations. ➃ creating evaluation data composed of questions, answers, and references derived from the configurations and documents.
  • Figure 3: Results (%) of Completeness of different query types under different Chunk-TopK settings on finance scenario in English dataset. We test three query types: Factual Question (FQ), Multi-hop Reasoning Question (MRQ), Numerical Comparison Question (NCQ).
  • Figure 4: QRA quality scoring criteria.
  • Figure 5: Document quality comparison criteria.
  • ...and 10 more figures