Table of Contents
Fetching ...

Controlled Retrieval-augmented Context Evaluation for Long-form RAG

Jia-Huei Ju, Suzan Verberne, Maarten de Rijke, Andrew Yates

TL;DR

CRUX introduces a controlled, content-focused evaluation framework for retrieval in long-form RAG by leveraging human-written multi-document summaries to define an oracle knowledge scope. It defines Cov, Den, and a novelty-aware ranked coverage metric to assess the retrieval context and its impact on final long-form results, independent of generation quality. Empirical results show substantial gaps between state-of-the-art retrieval pipelines and oracle retrieval, with CRUX aligning well with human judgments and better predicting final coverage than traditional ranking metrics. The framework provides a scalable, diagnostic tool and release-ready data/code to steer development of retrieval methods tailored for long-form RAG tasks.

Abstract

Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources. While the effectiveness of the retrieval module is typically evaluated with relevance-based ranking metrics, such metrics may be insufficient to reflect the retrieval's impact on the final RAG result, especially in long-form generation scenarios. We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation and propose metrics for assessing the context independent of generation. We introduce CRUX, a \textbf{C}ontrolled \textbf{R}etrieval-a\textbf{U}gmented conte\textbf{X}t evaluation framework designed to directly assess retrieval-augmented contexts. This framework uses human-written summaries to control the information scope of knowledge, enabling us to measure how well the context covers information essential for long-form generation. CRUX uses question-based evaluation to assess RAG's retrieval in a fine-grained manner. Empirical results show that CRUX offers more reflective and diagnostic evaluation. Our findings also reveal substantial room for improvement in current retrieval methods, pointing to promising directions for advancing RAG's retrieval. Our data and code are publicly available to support and advance future research on retrieval.

Controlled Retrieval-augmented Context Evaluation for Long-form RAG

TL;DR

CRUX introduces a controlled, content-focused evaluation framework for retrieval in long-form RAG by leveraging human-written multi-document summaries to define an oracle knowledge scope. It defines Cov, Den, and a novelty-aware ranked coverage metric to assess the retrieval context and its impact on final long-form results, independent of generation quality. Empirical results show substantial gaps between state-of-the-art retrieval pipelines and oracle retrieval, with CRUX aligning well with human judgments and better predicting final coverage than traditional ranking metrics. The framework provides a scalable, diagnostic tool and release-ready data/code to steer development of retrieval methods tailored for long-form RAG tasks.

Abstract

Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources. While the effectiveness of the retrieval module is typically evaluated with relevance-based ranking metrics, such metrics may be insufficient to reflect the retrieval's impact on the final RAG result, especially in long-form generation scenarios. We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation and propose metrics for assessing the context independent of generation. We introduce CRUX, a \textbf{C}ontrolled \textbf{R}etrieval-a\textbf{U}gmented conte\textbf{X}t evaluation framework designed to directly assess retrieval-augmented contexts. This framework uses human-written summaries to control the information scope of knowledge, enabling us to measure how well the context covers information essential for long-form generation. CRUX uses question-based evaluation to assess RAG's retrieval in a fine-grained manner. Empirical results show that CRUX offers more reflective and diagnostic evaluation. Our findings also reveal substantial room for improvement in current retrieval methods, pointing to promising directions for advancing RAG's retrieval. Our data and code are publicly available to support and advance future research on retrieval.

Paper Structure

This paper contains 50 sections, 5 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: An example of long-form generation with an open-ended query $x$ and a desired response $y$. The underlined text marks relevant content in the retrieval ( ) that contributes to the final result. By directly assessing the retrieval context $Z$, we can further explicitly identify incomplete ( ) and redundant retrieval ( ).
  • Figure 2: The controlled data generation derived from multi-document summarization datasets.
  • Figure 3: CRUX employs sub-question answerability to directly assess the textual content of both the retrieval context $Z$ and its corresponding RAG result $y$. The metrics include coverage and density.
  • Figure 4: Coverage of RAG results for 10 CRUX-DUC queries ($x$-axis) under three retrieval contexts ($y$-axis). Each subplot shows LLM-judged coverage (line) and human judgments (markers); bars indicate the annotators' average. The Spearman correlations $\rho$ are computed between the LLM and each annotator's coverage.
  • Figure 5: Kendall $\tau$ rank correlations between evaluation metrics on CRUX-DUC, using 48 random sampled retrieval contexts $Z$. Metrics include intermediate and final coverage, and other relevance-based metrics.
  • ...and 5 more figures