Table of Contents
Fetching ...

CRAG -- Comprehensive RAG Benchmark

Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar, Wen-tau Yih, Xin Luna Dong

TL;DR

CRAG presents a comprehensive RAG benchmark with 4,409 QA pairs across five domains and eight question types, supplemented by mock KG data and extensive web retrieval content to simulate real-world information access. It defines three retrieval-augmented QA tasks and introduces a truthfulness-focused evaluation that penalizes hallucinations, enabling reliable cross-system comparisons. Empirical results show that while straightforward RAG improves over LLM-only baselines, gaps remain in dynamic, tail, and complex questions, and even SOTA industry solutions struggle with hallucination and latency. By providing detailed evaluation protocols and scalable content, CRAG offers a practical platform to advance trustworthy RAG and QA research, with plans for future extensions and maintenance.

Abstract

Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation of this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% of questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge and attracted thousands of participants and submissions. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions. CRAG is available at https://github.com/facebookresearch/CRAG/.

CRAG -- Comprehensive RAG Benchmark

TL;DR

CRAG presents a comprehensive RAG benchmark with 4,409 QA pairs across five domains and eight question types, supplemented by mock KG data and extensive web retrieval content to simulate real-world information access. It defines three retrieval-augmented QA tasks and introduces a truthfulness-focused evaluation that penalizes hallucinations, enabling reliable cross-system comparisons. Empirical results show that while straightforward RAG improves over LLM-only baselines, gaps remain in dynamic, tail, and complex questions, and even SOTA industry solutions struggle with hallucination and latency. By providing detailed evaluation protocols and scalable content, CRAG offers a practical platform to advance trustworthy RAG and QA research, with plans for future extensions and maintenance.

Abstract

Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation of this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% of questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge and attracted thousands of participants and submissions. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions. CRAG is available at https://github.com/facebookresearch/CRAG/.
Paper Structure (31 sections, 1 equation, 4 figures, 11 tables)

This paper contains 31 sections, 1 equation, 4 figures, 11 tables.

Figures (4)

  • Figure 1: QA using LLMs (a) without RAG vs. (b) with RAG.
  • Figure 2: For $85\%$ of CRAG questions, the web search results are estimated to contain the ground truth facts. The curve shows that the retrieval recall grows sharply at the beginning and flattens out later on.
  • Figure 3: LLM-only and Task 3 solution auto-eval truthfulness (in percentage) across domain, dynamism, popularity, and question type.
  • Figure 4: SOTA systems human-eval truthfulness scores (in percentage) across different dimensions.