Table of Contents
Fetching ...

Don't Use LLMs to Make Relevance Judgments

Ian Soboroff

TL;DR

This paper argues against using large language models (LLMs) to generate ground-truth relevance judgments for TREC-style evaluations, because such judgments effectively set a performance ceiling tied to the generating model. It formalizes retrieval and evaluation as a single prediction problem with an ideal ranking that maximizes the evaluation metric, showing that ground-truth generated by an LLM cannot exceed the model's capabilities and can mask future improvements. The author surveys automatic evaluation, relevance-feedback-based learning, and existing LLM-based relevance predictions, highlighting how each approach either fails to provide unbiased truth data or introduces model-based biases. While discouraging LLM-generated ground truth, the paper notes potential non-ground-truth applications for LLMs, such as quality control and analysis support, and emphasizes the fundamental limitations of the relevance judgments barrier in IR evaluation.

Abstract

Making the relevance judgments for a TREC-style test collection can be complex and expensive. A typical TREC track usually involves a team of six contractors working for 2-4 weeks. Those contractors need to be trained and monitored. Software has to be written to support recording relevance judgments correctly and efficiently. The recent advent of large language models that produce astoundingly human-like flowing text output in response to a natural language prompt has inspired IR researchers to wonder how those models might be used in the relevance judgment collection process. At the ACM SIGIR 2024 conference, a workshop ``LLM4Eval'' provided a venue for this work, and featured a data challenge activity where participants reproduced TREC deep learning track judgments, as was done by Thomas et al (arXiv:2408.08896, arXiv:2309.10621). I was asked to give a keynote at the workshop, and this paper presents that keynote in article form. The bottom-line-up-front message is, don't use LLMs to create relevance judgments for TREC-style evaluations.

Don't Use LLMs to Make Relevance Judgments

TL;DR

This paper argues against using large language models (LLMs) to generate ground-truth relevance judgments for TREC-style evaluations, because such judgments effectively set a performance ceiling tied to the generating model. It formalizes retrieval and evaluation as a single prediction problem with an ideal ranking that maximizes the evaluation metric, showing that ground-truth generated by an LLM cannot exceed the model's capabilities and can mask future improvements. The author surveys automatic evaluation, relevance-feedback-based learning, and existing LLM-based relevance predictions, highlighting how each approach either fails to provide unbiased truth data or introduces model-based biases. While discouraging LLM-generated ground truth, the paper notes potential non-ground-truth applications for LLMs, such as quality control and analysis support, and emphasizes the fundamental limitations of the relevance judgments barrier in IR evaluation.

Abstract

Making the relevance judgments for a TREC-style test collection can be complex and expensive. A typical TREC track usually involves a team of six contractors working for 2-4 weeks. Those contractors need to be trained and monitored. Software has to be written to support recording relevance judgments correctly and efficiently. The recent advent of large language models that produce astoundingly human-like flowing text output in response to a natural language prompt has inspired IR researchers to wonder how those models might be used in the relevance judgment collection process. At the ACM SIGIR 2024 conference, a workshop ``LLM4Eval'' provided a venue for this work, and featured a data challenge activity where participants reproduced TREC deep learning track judgments, as was done by Thomas et al (arXiv:2408.08896, arXiv:2309.10621). I was asked to give a keynote at the workshop, and this paper presents that keynote in article form. The bottom-line-up-front message is, don't use LLMs to create relevance judgments for TREC-style evaluations.
Paper Structure (9 sections, 1 theorem, 4 equations, 1 figure)

This paper contains 9 sections, 1 theorem, 4 equations, 1 figure.

Key Result

Theorem 1

Let $C$ be a test collection $(D, S, R)$ where $R: s \mapsto d$ maps search needs to relevant documents $\{+,+,+, ...\}$. Let $A(s, D)$ be a ranking function that produces a ranking of documents $\{d_n \in D\}$ for a search need $s$. Let $E: A(s, D), R \mapsto \mathbb{R}$ be an evaluation metric tha the ranking the places the relevant documents ahead of any irrelevant documents. The ideal ranking

Figures (1)

  • Figure 1: Sample result from Soboroff2001, TREC-8, TREC-style pooling to depth 100.

Theorems & Definitions (1)

  • Theorem 1: ideal rankings