Table of Contents
Fetching ...

Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models

Jaekeol Choi

TL;DR

The paper investigates how prompt terms affect relevance evaluation with LLMs in information retrieval, comparing manually crafted prompts from prior work (M) with prompts automatically generated by templates (G) across zero-shot and few-shot settings using GPT-3.5-turbo and GPT-4 on MS MARCO TREC-DL data. It adopts Cohen's kappa to measure alignment between LLM judgments and human judgments and analyzes term-level effects via confusion matrices and precision/recall, revealing that 'answer'-oriented prompts outperform 'relevance'-focused ones and that few-shot context helps clarify relevance definitions. The main contributions are (i) a systematic term-level prompt analysis identifying 'answer' as a key positive term, (ii) demonstration of model-dependent differences with GPT-4 generally outperforming GPT-3.5-turbo, and (iii) evidence that few-shot prompts reduce performance gaps across prompts, particularly for advanced LLMs. These findings offer practical guidelines for prompt design to improve LLM-based relevance evaluation and suggest directions for integrating LLMs into ranking with improved efficiency.

Abstract

Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4, demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to identify which specific terms in prompts positively or negatively impact relevance evaluation with LLMs. We employed two types of prompts: those used in previous research and generated automatically by LLMs. By comparing the performance of these prompts in both few-shot and zero-shot settings, we analyze the influence of specific terms in the prompts. We have observed two main findings from our study. First, we discovered that prompts using the term answerlead to more effective relevance evaluations than those using relevant. This indicates that a more direct approach, focusing on answering the query, tends to enhance performance. Second, we noted the importance of appropriately balancing the scope of relevance. While the term relevant can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments. The inclusion of few-shot examples helps in more precisely defining this balance. By providing clearer contexts for the term relevance, few-shot examples contribute to refine relevance criteria. In conclusion, our study highlights the significance of carefully selecting terms in prompts for relevance evaluation with LLMs.

Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models

TL;DR

The paper investigates how prompt terms affect relevance evaluation with LLMs in information retrieval, comparing manually crafted prompts from prior work (M) with prompts automatically generated by templates (G) across zero-shot and few-shot settings using GPT-3.5-turbo and GPT-4 on MS MARCO TREC-DL data. It adopts Cohen's kappa to measure alignment between LLM judgments and human judgments and analyzes term-level effects via confusion matrices and precision/recall, revealing that 'answer'-oriented prompts outperform 'relevance'-focused ones and that few-shot context helps clarify relevance definitions. The main contributions are (i) a systematic term-level prompt analysis identifying 'answer' as a key positive term, (ii) demonstration of model-dependent differences with GPT-4 generally outperforming GPT-3.5-turbo, and (iii) evidence that few-shot prompts reduce performance gaps across prompts, particularly for advanced LLMs. These findings offer practical guidelines for prompt design to improve LLM-based relevance evaluation and suggest directions for integrating LLMs into ranking with improved efficiency.

Abstract

Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4, demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to identify which specific terms in prompts positively or negatively impact relevance evaluation with LLMs. We employed two types of prompts: those used in previous research and generated automatically by LLMs. By comparing the performance of these prompts in both few-shot and zero-shot settings, we analyze the influence of specific terms in the prompts. We have observed two main findings from our study. First, we discovered that prompts using the term answerlead to more effective relevance evaluations than those using relevant. This indicates that a more direct approach, focusing on answering the query, tends to enhance performance. Second, we noted the importance of appropriately balancing the scope of relevance. While the term relevant can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments. The inclusion of few-shot examples helps in more precisely defining this balance. By providing clearer contexts for the term relevance, few-shot examples contribute to refine relevance criteria. In conclusion, our study highlights the significance of carefully selecting terms in prompts for relevance evaluation with LLMs.
Paper Structure (24 sections, 1 equation, 2 figures, 7 tables)

This paper contains 24 sections, 1 equation, 2 figures, 7 tables.

Figures (2)

  • Figure 1: A prompt example for relevance evaluation. This example utilizes 2-shot examples.
  • Figure 2: Average Cohen's kappa values for top-5 and bottom-5 prompts in GPT-3.5-turbo and GPT-4 across few-shot and zero-shot settings.