Large Language Models are Inconsistent and Biased Evaluators
Rickard Stureborg, Dimitris Alikaniotis, Yoshi Suhara
TL;DR
The paper scrutinizes the reliability of LLM-based evaluators for text summarization, revealing familiarity bias, score biases, anchoring, and self-consistency issues. It conducts large-scale analyses using SummEval and RoSE with GPT-3.5 and GPT-4, examining prompts, granularity, and temperature/CoT effects across over 560k outputs. A recipe-driven mitigation strategy is proposed and validated on RoSE, achieving state-of-the-art improvements over previous LLM evaluators. The work cautions against blind use of LLM evaluators without robust prompt design and bias mitigation, and it provides practical guidelines to build more robust, reference-free evaluation tools.
Abstract
The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.
