A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look
Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, Jimmy Lin
TL;DR
This study evaluates four in-situ relevance-assessment pipelines, including a fully automated LLM-based approach via UMBRELA, against the traditional NIST manual judgments in the TREC 2024 RAG Track. Using Kendall's $ au$ to compare system rankings across $nDCG@20$, $nDCG@100$, and $Recall@100$, it analyzes 77 runs over 301 topics to quantify cost–quality tradeoffs. The key finding is that automatically generated UMBRELA judgments correlate highly with fully manual judgments at the run level, while added human-in-the-loop steps do not yield additional benefits; human assessors generally apply stricter relevance criteria. The results validate LLM-based relevance assessments in academic IR meta-evaluation and establish a scalable framework for future evaluations, highlighting both potential savings and limitations of LLM-driven labeling.
Abstract
The application of large language models to provide relevance assessments presents exciting opportunities to advance information retrieval, natural language processing, and beyond, but to date many unknowns remain. This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the "standard" fully manual process that NIST has implemented for decades and three different alternatives that take advantage of LLMs to different extents using the open-source UMBRELA tool. This setup allows us to correlate system rankings induced by the different approaches to characterize tradeoffs between cost and quality. We find that in terms of nDCG@20, nDCG@100, and Recall@100, system rankings induced by automatically generated relevance assessments from UMBRELA correlate highly with those induced by fully manual assessments across a diverse set of 77 runs from 19 teams. Our results suggest that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits. Overall, human assessors appear to be stricter than UMBRELA in applying relevance criteria. Our work validates the use of LLMs in academic TREC-style evaluations and provides the foundation for future studies.
