Table of Contents
Fetching ...

A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, Jimmy Lin

TL;DR

This study evaluates four in-situ relevance-assessment pipelines, including a fully automated LLM-based approach via UMBRELA, against the traditional NIST manual judgments in the TREC 2024 RAG Track. Using Kendall's $ au$ to compare system rankings across $nDCG@20$, $nDCG@100$, and $Recall@100$, it analyzes 77 runs over 301 topics to quantify cost–quality tradeoffs. The key finding is that automatically generated UMBRELA judgments correlate highly with fully manual judgments at the run level, while added human-in-the-loop steps do not yield additional benefits; human assessors generally apply stricter relevance criteria. The results validate LLM-based relevance assessments in academic IR meta-evaluation and establish a scalable framework for future evaluations, highlighting both potential savings and limitations of LLM-driven labeling.

Abstract

The application of large language models to provide relevance assessments presents exciting opportunities to advance information retrieval, natural language processing, and beyond, but to date many unknowns remain. This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the "standard" fully manual process that NIST has implemented for decades and three different alternatives that take advantage of LLMs to different extents using the open-source UMBRELA tool. This setup allows us to correlate system rankings induced by the different approaches to characterize tradeoffs between cost and quality. We find that in terms of nDCG@20, nDCG@100, and Recall@100, system rankings induced by automatically generated relevance assessments from UMBRELA correlate highly with those induced by fully manual assessments across a diverse set of 77 runs from 19 teams. Our results suggest that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits. Overall, human assessors appear to be stricter than UMBRELA in applying relevance criteria. Our work validates the use of LLMs in academic TREC-style evaluations and provides the foundation for future studies.

A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look

TL;DR

This study evaluates four in-situ relevance-assessment pipelines, including a fully automated LLM-based approach via UMBRELA, against the traditional NIST manual judgments in the TREC 2024 RAG Track. Using Kendall's to compare system rankings across , , and , it analyzes 77 runs over 301 topics to quantify cost–quality tradeoffs. The key finding is that automatically generated UMBRELA judgments correlate highly with fully manual judgments at the run level, while added human-in-the-loop steps do not yield additional benefits; human assessors generally apply stricter relevance criteria. The results validate LLM-based relevance assessments in academic IR meta-evaluation and establish a scalable framework for future evaluations, highlighting both potential savings and limitations of LLM-driven labeling.

Abstract

The application of large language models to provide relevance assessments presents exciting opportunities to advance information retrieval, natural language processing, and beyond, but to date many unknowns remain. This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the "standard" fully manual process that NIST has implemented for decades and three different alternatives that take advantage of LLMs to different extents using the open-source UMBRELA tool. This setup allows us to correlate system rankings induced by the different approaches to characterize tradeoffs between cost and quality. We find that in terms of nDCG@20, nDCG@100, and Recall@100, system rankings induced by automatically generated relevance assessments from UMBRELA correlate highly with those induced by fully manual assessments across a diverse set of 77 runs from 19 teams. Our results suggest that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits. Overall, human assessors appear to be stricter than UMBRELA in applying relevance criteria. Our work validates the use of LLMs in academic TREC-style evaluations and provides the foundation for future studies.

Paper Structure

This paper contains 15 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The first five topics from the TREC 2024 RAG Track.
  • Figure 2: The prompt utilized with UMBRELA for relevance assessment.
  • Figure 3: Comparisons between UMBRELA scores and scores from fully manual (top row), manual with filtering (middle row), and manual with post-editing (bottom row). Columns show different metrics: nDCG@20, nDCG@100, and Recall@100. In each scatter plot, red dots show run-level scores and the blue dots show all topic/run combinations. Each scatter plot is annotated with rank correlations in terms of Kendall's $\tau$. This analysis is performed on common (i.e., overlapping) topics.
  • Figure 4: Run-level rank correlations (Kendall's $\tau$) comparing manual with filtering vs. fully manual (top row) and manual with post-editing vs. fully manual (bottom row). Columns show different metrics: nDCG@20, nDCG@100, and Recall@100. Note that topics are disjoint in this analysis.
  • Figure 5: Confusion matrices comparing UMBRELA with the fully manual process (left), manual with filtering (middle), and manual with post-editing (right).