Table of Contents
Fetching ...

The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models

Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin

TL;DR

This work tackles the challenge of evaluating long-form RAG outputs by adopting a nugget-based framework derived from the TREC QA Track and retooling it for LLM-driven automation. It introduces Auto-Nuggetizer, which uses LLMs to automatically create and assign information nuggets to system answers, and validates these automatic scores against human annotations from the TREC 2024 RAG Track. The study shows strong run-level agreement between fully automatic nugget evaluation and human-based variants, with even stronger alignment when nugget assignment is automated alone, and highlights per-topic variability that calls for further calibration. The results demonstrate a scalable, cost-efficient evaluation pathway for RAG systems while acknowledging the need for per-topic diagnostic reliability and careful calibration of LLM-based judgments to preserve diagnostic utility.

Abstract

Large Language Models (LLMs) have significantly enhanced the capabilities of information access systems, especially with retrieval-augmented generation (RAG). Nevertheless, the evaluation of RAG systems remains a barrier to continued progress, a challenge we tackle in this work by proposing an automatic evaluation framework that is validated against human annotations. We believe that the nugget evaluation methodology provides a solid foundation for evaluating RAG systems. This approach, originally developed for the TREC Question Answering (QA) Track in 2003, evaluates systems based on atomic facts that should be present in good answers. Our efforts focus on "refactoring" this methodology, where we describe the AutoNuggetizer framework that specifically applies LLMs to both automatically create nuggets and automatically assign nuggets to system answers. In the context of the TREC 2024 RAG Track, we calibrate a fully automatic approach against strategies where nuggets are created manually or semi-manually by human assessors and then assigned manually to system answers. Based on results from a community-wide evaluation, we observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants. The agreement is stronger when individual framework components such as nugget assignment are automated independently. This suggests that our evaluation framework provides tradeoffs between effort and quality that can be used to guide the development of future RAG systems. However, further research is necessary to refine our approach, particularly in establishing robust per-topic agreement to diagnose system failures effectively.

The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models

TL;DR

This work tackles the challenge of evaluating long-form RAG outputs by adopting a nugget-based framework derived from the TREC QA Track and retooling it for LLM-driven automation. It introduces Auto-Nuggetizer, which uses LLMs to automatically create and assign information nuggets to system answers, and validates these automatic scores against human annotations from the TREC 2024 RAG Track. The study shows strong run-level agreement between fully automatic nugget evaluation and human-based variants, with even stronger alignment when nugget assignment is automated alone, and highlights per-topic variability that calls for further calibration. The results demonstrate a scalable, cost-efficient evaluation pathway for RAG systems while acknowledging the need for per-topic diagnostic reliability and careful calibration of LLM-based judgments to preserve diagnostic utility.

Abstract

Large Language Models (LLMs) have significantly enhanced the capabilities of information access systems, especially with retrieval-augmented generation (RAG). Nevertheless, the evaluation of RAG systems remains a barrier to continued progress, a challenge we tackle in this work by proposing an automatic evaluation framework that is validated against human annotations. We believe that the nugget evaluation methodology provides a solid foundation for evaluating RAG systems. This approach, originally developed for the TREC Question Answering (QA) Track in 2003, evaluates systems based on atomic facts that should be present in good answers. Our efforts focus on "refactoring" this methodology, where we describe the AutoNuggetizer framework that specifically applies LLMs to both automatically create nuggets and automatically assign nuggets to system answers. In the context of the TREC 2024 RAG Track, we calibrate a fully automatic approach against strategies where nuggets are created manually or semi-manually by human assessors and then assigned manually to system answers. Based on results from a community-wide evaluation, we observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants. The agreement is stronger when individual framework components such as nugget assignment are automated independently. This suggests that our evaluation framework provides tradeoffs between effort and quality that can be used to guide the development of future RAG systems. However, further research is necessary to refine our approach, particularly in establishing robust per-topic agreement to diagnose system failures effectively.

Paper Structure

This paper contains 27 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Prompt for the iterative nuggetization at turn $i$.
  • Figure 2: Prompt for determining the importance of nuggets. At each turn, at most 10 nuggets are passed to the LLM.
  • Figure 3: Prompt for nugget assignment. At each turn, at most 10 nuggets are passed to the LLM.
  • Figure 4: Scatter plots between manual vs. automatic $V_{\textrm{strict}}$ and $A_{\textrm{strict}}$ scores for AG and RAG runs. The $x$ axes show scores from Auto-Nuggets / Auto-Assign and the $y$ axes show scores from different manual conditions. Red circles (RAG runs) and orange squares (AG runs) represent run-level scores. Blue circles (RAG runs) and purple squares (AG runs) show all topic/run combinations. The bottom-right box reports Kendall's $\tau$ correlations at the run level (red circles/orange squares), over all topic/run combinations (blue circles/purple squares), average of per-topic correlations.
  • Figure 5: Scatter plots showing correlations designed to answer RQ2, isolating the effects of nugget assignment. Overall organization is identical to Figure \ref{['fig:manual_vs_auto']}.
  • ...and 1 more figures