Table of Contents
Fetching ...

Can We Use Large Language Models to Fill Relevance Judgment Holes?

Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, Mohammad Aliannejadi

TL;DR

This work tackles the challenge of incomplete relevance judgments in reusable IR test collections by proposing grounded LLM-based annotations to extend evaluation pools, with a focus on Conversational Search in TREC iKAT. It systematically compares commercial ChatGPT and open-source LLaMA models under zero/one/two-shot prompts and PEFT-based fine-tuning, assessing both label agreement and model ranking against human judgments. Key findings show that while LLM-generated pools can yield rankings highly correlated with human-based rankings, binary/graded label agreement can be substantially lower, and the presence of holes can bias new-model evaluations. The authors conclude that generating the entire pool with LLM annotations tends to produce more consistent, ground-truth-aligned comparisons, and they highlight prompting design and grounding as crucial directions for future work to improve alignment with human judgments.

Abstract

Incomplete relevance judgments limit the re-usability of test collections. When new systems are compared against previous systems used to build the pool of judged documents, they often do so at a disadvantage due to the ``holes'' in test collection (i.e., pockets of un-assessed documents returned by the new system). In this paper, we take initial steps towards extending existing test collections by employing Large Language Models (LLM) to fill the holes by leveraging and grounding the method using existing human judgments. We explore this problem in the context of Conversational Search using TREC iKAT, where information needs are highly dynamic and the responses (and, the results retrieved) are much more varied (leaving bigger holes). While previous work has shown that automatic judgments from LLMs result in highly correlated rankings, we find substantially lower correlates when human plus automatic judgments are used (regardless of LLM, one/two/few shot, or fine-tuned). We further find that, depending on the LLM employed, new runs will be highly favored (or penalized), and this effect is magnified proportionally to the size of the holes. Instead, one should generate the LLM annotations on the whole document pool to achieve more consistent rankings with human-generated labels. Future work is required to prompt engineering and fine-tuning LLMs to reflect and represent the human annotations, in order to ground and align the models, such that they are more fit for purpose.

Can We Use Large Language Models to Fill Relevance Judgment Holes?

TL;DR

This work tackles the challenge of incomplete relevance judgments in reusable IR test collections by proposing grounded LLM-based annotations to extend evaluation pools, with a focus on Conversational Search in TREC iKAT. It systematically compares commercial ChatGPT and open-source LLaMA models under zero/one/two-shot prompts and PEFT-based fine-tuning, assessing both label agreement and model ranking against human judgments. Key findings show that while LLM-generated pools can yield rankings highly correlated with human-based rankings, binary/graded label agreement can be substantially lower, and the presence of holes can bias new-model evaluations. The authors conclude that generating the entire pool with LLM annotations tends to produce more consistent, ground-truth-aligned comparisons, and they highlight prompting design and grounding as crucial directions for future work to improve alignment with human judgments.

Abstract

Incomplete relevance judgments limit the re-usability of test collections. When new systems are compared against previous systems used to build the pool of judged documents, they often do so at a disadvantage due to the ``holes'' in test collection (i.e., pockets of un-assessed documents returned by the new system). In this paper, we take initial steps towards extending existing test collections by employing Large Language Models (LLM) to fill the holes by leveraging and grounding the method using existing human judgments. We explore this problem in the context of Conversational Search using TREC iKAT, where information needs are highly dynamic and the responses (and, the results retrieved) are much more varied (leaving bigger holes). While previous work has shown that automatic judgments from LLMs result in highly correlated rankings, we find substantially lower correlates when human plus automatic judgments are used (regardless of LLM, one/two/few shot, or fine-tuned). We further find that, depending on the LLM employed, new runs will be highly favored (or penalized), and this effect is magnified proportionally to the size of the holes. Instead, one should generate the LLM annotations on the whole document pool to achieve more consistent rankings with human-generated labels. Future work is required to prompt engineering and fine-tuning LLMs to reflect and represent the human annotations, in order to ground and align the models, such that they are more fit for purpose.
Paper Structure (13 sections, 2 figures, 8 tables)

This paper contains 13 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Ranking correlation between human- and LLM-generated pools using the $K$ best-performing models from TREC iKAT 2023 runs according to the human-generated pool. The ChatGPT one-shot ($tmp=0$) model is used for generating the LLM-based pool.
  • Figure 2: Absolute distance between the location of a new run before and after filling the holes using ChatGPT and LLaMA. The X-axis shows the average portion of unjudged documents among the top 10 documents returned by the run in the existing human-generated pool. We use the one-shot ChatGPT model with a temperature of 0 and zero-shot LLaMA-3-inst for hole filling.