Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study
Mingyang Song, Mao Zheng, Xuan Luo, Yue Pan
TL;DR
The paper examines whether many-shot in-context learning can improve large language models as evaluators by introducing two prompt templates, MSwR and MSoR, designed to reduce evaluator biases. Using GSM8K data and LLaMA3-70B to generate questions and rationales, with GPT-4o as the evaluator, the study shows that increasing the number of in-context demonstrations enhances both evaluation consistency and quality, with MSwR outperforming MSoR. It also uncovers symbol and positional biases in evaluators and proposes mitigation by combining evaluation results. The work provides a basis for more reliable LLM-based evaluation and suggests further exploration of bias mitigation and longer-context prompting in future research.
Abstract
Utilizing Large Language Models (LLMs) as evaluators to assess the performance of LLMs has garnered attention. However, this kind of evaluation approach is affected by potential biases within LLMs, raising concerns about the accuracy and reliability of the evaluation results of LLMs. To address this problem, we propose and study two many-shot In-Context Learning (ICL) prompt templates to help LLM evaluators mitigate potential biases: Many-Shot with Reference (MSwR) and Many-Shot without Reference (MSoR). Specifically, the former utilizes in-context examples with model-generated evaluation rationales as references, while the latter does not include these references. Using these prompt designs, we investigate the impact of increasing the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot and few-shot regimes. Furthermore, when using GPT-4o as an evaluator in the many-shot regime, adopting MSwR as the prompt template performs better than MSoR.
