Table of Contents
Fetching ...

Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

Gabrielle Kaili-May Liu, Bryan Li, Arman Cohan, William Gantt Walden, Eugene Yang

TL;DR

The paper addresses a critical gap in evaluating retrieval-augmented generation by introducing CRUMQs, a pipeline that automatically generates unanswerable, multi-hop, and realistic queries tailored to any corpus. The method combines topic extraction, external-source augmentation, controlled multi-document contexts, seed query generation, unanswerability verification, and CoT-based hop validation to produce high-quality benchmarks. Experiments on NeuCLIR and TREC RAG 2025 show CRUMQs are significantly more challenging and less cheat-able than prior benchmarks, with up to an 81% reduction in cheatability compared to existing datasets. This work enables controllable benchmark difficulty, enhances realism in RAG evaluation, and provides a foundation for developing more capable, reliable RAG systems in real-world settings.

Abstract

Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un$\underline{c}$heatable, $\underline{r}$ealistic, $\underline{u}$nanswerable, and $\underline{m}$ulti-hop $\underline{q}$uerie$\underline{s}$ (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0\% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.

Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

TL;DR

The paper addresses a critical gap in evaluating retrieval-augmented generation by introducing CRUMQs, a pipeline that automatically generates unanswerable, multi-hop, and realistic queries tailored to any corpus. The method combines topic extraction, external-source augmentation, controlled multi-document contexts, seed query generation, unanswerability verification, and CoT-based hop validation to produce high-quality benchmarks. Experiments on NeuCLIR and TREC RAG 2025 show CRUMQs are significantly more challenging and less cheat-able than prior benchmarks, with up to an 81% reduction in cheatability compared to existing datasets. This work enables controllable benchmark difficulty, enhances realism in RAG evaluation, and provides a foundation for developing more capable, reliable RAG systems in real-world settings.

Abstract

Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of unheatable, ealistic, nanswerable, and ulti-hop uerie (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0\% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.

Paper Structure

This paper contains 8 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the CRUMQs generation pipeline.
  • Figure 2: Acceptable ratios of RAG systems on CRUMQs across hop counts. Performance drops with more hops.
  • Figure 3: DiRe F1 score ratios $(\downarrow)$ across benchmarks and tasks. Black points denote accuracy per model per task (values on right axis).