Table of Contents
Fetching ...

Improving the Reusability of Conversational Search Test Collections

Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, Mohammad Aliannejadi

TL;DR

This work tackles the reusability problem in conversational search test collections caused by large pockets of unjudged documents. It proposes using both commercial GPT-3.5 and OS LLMs, notably fine-tuned Llama-3.1 through few-shot prompts and 4-bit QLoRA, to generate relevance judgments and fill holes in iKAT 23 and CAsT 22 collections. The findings show that deeper turns in CS reduce reusability, while few-shot training of Llama-3.1 achieves high agreement with human judgments and improves fairness in downstream rankings; zero-shot GPT-3.5 can yield strong rank correlations, but its absolute judgments differ from humans. Overall, regenerating or augmenting the pool with LLM-based judgments—preferably via few-shot Llama-3.1 for missing judgments—can produce more reusable CS test collections at lower cost, enabling equitable evaluation of future systems.

Abstract

Incomplete relevance judgments limit the reusability of test collections. When new systems are compared to previous systems that contributed to the pool, they often face a disadvantage. This is due to pockets of unjudged documents (called holes) in the test collection that the new systems return. The very nature of Conversational Search (CS) means that these holes are potentially larger and more problematic when evaluating systems. In this paper, we aim to extend CS test collections by employing Large Language Models (LLMs) to fill holes by leveraging existing judgments. We explore this problem using TREC iKAT 23 and TREC CAsT 22 collections, where information needs are highly dynamic and the responses are much more varied, leaving bigger holes to fill. Our experiments reveal that CS collections show a trend towards less reusability in deeper turns. Also, fine-tuning the Llama 3.1 model leads to high agreement with human assessors, while few-shot prompting the ChatGPT results in low agreement with humans. Consequently, filling the holes of a new system using ChatGPT leads to a higher change in the location of the new system. While regenerating the assessment pool with few-shot prompting the ChatGPT model and using it for evaluation achieves a high rank correlation with human-assessed pools. We show that filling the holes using few-shot training the Llama 3.1 model enables a fairer comparison between the new system and the systems contributed to the pool. Our hole-filling model based on few-shot training of the Llama 3.1 model can improve the reusability of test collections.

Improving the Reusability of Conversational Search Test Collections

TL;DR

This work tackles the reusability problem in conversational search test collections caused by large pockets of unjudged documents. It proposes using both commercial GPT-3.5 and OS LLMs, notably fine-tuned Llama-3.1 through few-shot prompts and 4-bit QLoRA, to generate relevance judgments and fill holes in iKAT 23 and CAsT 22 collections. The findings show that deeper turns in CS reduce reusability, while few-shot training of Llama-3.1 achieves high agreement with human judgments and improves fairness in downstream rankings; zero-shot GPT-3.5 can yield strong rank correlations, but its absolute judgments differ from humans. Overall, regenerating or augmenting the pool with LLM-based judgments—preferably via few-shot Llama-3.1 for missing judgments—can produce more reusable CS test collections at lower cost, enabling equitable evaluation of future systems.

Abstract

Incomplete relevance judgments limit the reusability of test collections. When new systems are compared to previous systems that contributed to the pool, they often face a disadvantage. This is due to pockets of unjudged documents (called holes) in the test collection that the new systems return. The very nature of Conversational Search (CS) means that these holes are potentially larger and more problematic when evaluating systems. In this paper, we aim to extend CS test collections by employing Large Language Models (LLMs) to fill holes by leveraging existing judgments. We explore this problem using TREC iKAT 23 and TREC CAsT 22 collections, where information needs are highly dynamic and the responses are much more varied, leaving bigger holes to fill. Our experiments reveal that CS collections show a trend towards less reusability in deeper turns. Also, fine-tuning the Llama 3.1 model leads to high agreement with human assessors, while few-shot prompting the ChatGPT results in low agreement with humans. Consequently, filling the holes of a new system using ChatGPT leads to a higher change in the location of the new system. While regenerating the assessment pool with few-shot prompting the ChatGPT model and using it for evaluation achieves a high rank correlation with human-assessed pools. We show that filling the holes using few-shot training the Llama 3.1 model enables a fairer comparison between the new system and the systems contributed to the pool. Our hole-filling model based on few-shot training of the Llama 3.1 model can improve the reusability of test collections.

Paper Structure

This paper contains 11 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Distribution of average and standard deviation of the count of unique documents ($\phi$) and relevant unique documents ($\phi^{+}$) retrieved by systems per depth of the conversation. These plot are based on leave-one-team-out scenario.
  • Figure 2: Rank correlation between the $K$ best-performing systems from TREC iKAT 23 using $P$ and $P_{llm}$ pools (based on leave-one-model-out scenario).
  • Figure 3: Absolute distance between the location of a new system before and after filling the holes using GPT-3.5 and Llama-3.1.
  • Figure 4: Average rank correlation ($\tau$) of systems and rank change ($D$) of a new system using $P_{\mathrm{filled}}$ or $P_{\mathrm{hole}}$ with original pool $P$ over conversational turns with different depth. The $P_{\mathrm{filled}}$ is formed using the judgments by few-shot prompting the Llama-3.1 model and leave-one-team-out scenario.