Table of Contents
Fetching ...

Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study

Mingyang Song, Mao Zheng, Xuan Luo, Yue Pan

TL;DR

The paper examines whether many-shot in-context learning can improve large language models as evaluators by introducing two prompt templates, MSwR and MSoR, designed to reduce evaluator biases. Using GSM8K data and LLaMA3-70B to generate questions and rationales, with GPT-4o as the evaluator, the study shows that increasing the number of in-context demonstrations enhances both evaluation consistency and quality, with MSwR outperforming MSoR. It also uncovers symbol and positional biases in evaluators and proposes mitigation by combining evaluation results. The work provides a basis for more reliable LLM-based evaluation and suggests further exploration of bias mitigation and longer-context prompting in future research.

Abstract

Utilizing Large Language Models (LLMs) as evaluators to assess the performance of LLMs has garnered attention. However, this kind of evaluation approach is affected by potential biases within LLMs, raising concerns about the accuracy and reliability of the evaluation results of LLMs. To address this problem, we propose and study two many-shot In-Context Learning (ICL) prompt templates to help LLM evaluators mitigate potential biases: Many-Shot with Reference (MSwR) and Many-Shot without Reference (MSoR). Specifically, the former utilizes in-context examples with model-generated evaluation rationales as references, while the latter does not include these references. Using these prompt designs, we investigate the impact of increasing the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot and few-shot regimes. Furthermore, when using GPT-4o as an evaluator in the many-shot regime, adopting MSwR as the prompt template performs better than MSoR.

Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study

TL;DR

The paper examines whether many-shot in-context learning can improve large language models as evaluators by introducing two prompt templates, MSwR and MSoR, designed to reduce evaluator biases. Using GSM8K data and LLaMA3-70B to generate questions and rationales, with GPT-4o as the evaluator, the study shows that increasing the number of in-context demonstrations enhances both evaluation consistency and quality, with MSwR outperforming MSoR. It also uncovers symbol and positional biases in evaluators and proposes mitigation by combining evaluation results. The work provides a basis for more reliable LLM-based evaluation and suggests further exploration of bias mitigation and longer-context prompting in future research.

Abstract

Utilizing Large Language Models (LLMs) as evaluators to assess the performance of LLMs has garnered attention. However, this kind of evaluation approach is affected by potential biases within LLMs, raising concerns about the accuracy and reliability of the evaluation results of LLMs. To address this problem, we propose and study two many-shot In-Context Learning (ICL) prompt templates to help LLM evaluators mitigate potential biases: Many-Shot with Reference (MSwR) and Many-Shot without Reference (MSoR). Specifically, the former utilizes in-context examples with model-generated evaluation rationales as references, while the latter does not include these references. Using these prompt designs, we investigate the impact of increasing the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot and few-shot regimes. Furthermore, when using GPT-4o as an evaluator in the many-shot regime, adopting MSwR as the prompt template performs better than MSoR.
Paper Structure (15 sections, 8 figures, 6 tables)

This paper contains 15 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Consistency between different versions of evaluation results by adopting GPT-4o as a zero-shot evaluator. $v^{\prime}_1$ and $v^{\prime}_2$ are the results based on Prompt(A) in Table \ref{['zero_shot']}. $v_1$, $v_2$, and $v_3$ are results based on Prompt(B) in Table \ref{['zero_shot']}. Prompts (A) and (B) differ in whether to output the rating first or later. The consistency evaluations show that Prompt (A) and (B) almost obtain the agreement results, but the latter is convenient for constructing many-shot in-context examples, so we adopt the latter generated rationales in this study. $v_1$ vs. $v_2$ denotes comparing the first and second versions of evaluations. $v_1$ vs. $v_2$ vs. $v_3$ denotes the comparison between the three versions of evaluations.
  • Figure 2: Evaluate the results of LLaMA3-70b on the GSM8K dataset using the Prompt(A).
  • Figure 3: Results of random selection.
  • Figure 4: Consistency between two versions of evaluation results. Concretely, the bar corresponding to "0" on the x-axis represents the number of samples with consistent and inconsistent ratings in comparing evaluation results obtained twice using GPT-4o as the evaluator in the zero-shot regime. In addition, the zero-shot generated rationales are used for MSwR and MSoR. The bar corresponding to "$2^n$" on the x-axis represents the consistency of using the GPT-4o as an evaluator in MSwR.
  • Figure 5: Compare the consistency of the results from the two evaluations.
  • ...and 3 more figures