Table of Contents
Fetching ...

UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, Jimmy Lin

TL;DR

This work tackles the high cost of manual relevance labeling by presenting UM-BRELA, an open-source replication of the Bing RELevance Assessor that uses GPT-4o with DNA prompting to generate 0–3 relevance judgments for query-passage pairs. Across the TREC Deep Learning tracks 2019–2023, LLM-based judgments show meaningful correlations with human judgments and preserve system rankings, even after near-duplicate deduplication. The toolkit is designed for easy integration into existing multi-stage retrieval and evaluation pipelines and is slated for use in the TREC 2024 RAG Track, positioning it as a practical foundation for future retrieval-evaluation research. Overall, the paper demonstrates that LLMs can provide high-quality, scalable relevance judgments to accelerate and augment IR evaluation while maintaining alignment with human expertise.

Abstract

Copious amounts of relevance judgments are necessary for the effective training and accurate evaluation of retrieval systems. Conventionally, these judgments are made by human assessors, rendering this process expensive and laborious. A recent study by Thomas et al. from Microsoft Bing suggested that large language models (LLMs) can accurately perform the relevance assessment task and provide human-quality judgments, but unfortunately their study did not yield any reusable software artifacts. Our work presents UMBRELA (a recursive acronym that stands for UMbrela is the Bing RELevance Assessor), an open-source toolkit that reproduces the results of Thomas et al. using OpenAI's GPT-4o model and adds more nuance to the original paper. Across Deep Learning Tracks from TREC 2019 to 2023, we find that LLM-derived relevance judgments correlate highly with rankings generated by effective multi-stage retrieval systems. Our toolkit is designed to be easily extensible and can be integrated into existing multi-stage retrieval and evaluation pipelines, offering researchers a valuable resource for studying retrieval evaluation methodologies. UMBRELA will be used in the TREC 2024 RAG Track to aid in relevance assessments, and we envision our toolkit becoming a foundation for further innovation in the field. UMBRELA is available at https://github.com/castorini/umbrela.

UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor

TL;DR

This work tackles the high cost of manual relevance labeling by presenting UM-BRELA, an open-source replication of the Bing RELevance Assessor that uses GPT-4o with DNA prompting to generate 0–3 relevance judgments for query-passage pairs. Across the TREC Deep Learning tracks 2019–2023, LLM-based judgments show meaningful correlations with human judgments and preserve system rankings, even after near-duplicate deduplication. The toolkit is designed for easy integration into existing multi-stage retrieval and evaluation pipelines and is slated for use in the TREC 2024 RAG Track, positioning it as a practical foundation for future retrieval-evaluation research. Overall, the paper demonstrates that LLMs can provide high-quality, scalable relevance judgments to accelerate and augment IR evaluation while maintaining alignment with human expertise.

Abstract

Copious amounts of relevance judgments are necessary for the effective training and accurate evaluation of retrieval systems. Conventionally, these judgments are made by human assessors, rendering this process expensive and laborious. A recent study by Thomas et al. from Microsoft Bing suggested that large language models (LLMs) can accurately perform the relevance assessment task and provide human-quality judgments, but unfortunately their study did not yield any reusable software artifacts. Our work presents UMBRELA (a recursive acronym that stands for UMbrela is the Bing RELevance Assessor), an open-source toolkit that reproduces the results of Thomas et al. using OpenAI's GPT-4o model and adds more nuance to the original paper. Across Deep Learning Tracks from TREC 2019 to 2023, we find that LLM-derived relevance judgments correlate highly with rankings generated by effective multi-stage retrieval systems. Our toolkit is designed to be easily extensible and can be integrated into existing multi-stage retrieval and evaluation pipelines, offering researchers a valuable resource for studying retrieval evaluation methodologies. UMBRELA will be used in the TREC 2024 RAG Track to aid in relevance assessments, and we envision our toolkit becoming a foundation for further innovation in the field. UMBRELA is available at https://github.com/castorini/umbrela.
Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Prompt used for relevance assessment.
  • Figure 2: Confusion matrices comparing the original human labels with those generated by the LLM.
  • Figure 3: Scatter plots for comparing evaluations performed using original human assessments and LLM assessments.