Table of Contents
Fetching ...

Large Language Models are Inconsistent and Biased Evaluators

Rickard Stureborg, Dimitris Alikaniotis, Yoshi Suhara

TL;DR

The paper scrutinizes the reliability of LLM-based evaluators for text summarization, revealing familiarity bias, score biases, anchoring, and self-consistency issues. It conducts large-scale analyses using SummEval and RoSE with GPT-3.5 and GPT-4, examining prompts, granularity, and temperature/CoT effects across over 560k outputs. A recipe-driven mitigation strategy is proposed and validated on RoSE, achieving state-of-the-art improvements over previous LLM evaluators. The work cautions against blind use of LLM evaluators without robust prompt design and bias mitigation, and it provides practical guidelines to build more robust, reference-free evaluation tools.

Abstract

The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.

Large Language Models are Inconsistent and Biased Evaluators

TL;DR

The paper scrutinizes the reliability of LLM-based evaluators for text summarization, revealing familiarity bias, score biases, anchoring, and self-consistency issues. It conducts large-scale analyses using SummEval and RoSE with GPT-3.5 and GPT-4, examining prompts, granularity, and temperature/CoT effects across over 560k outputs. A recipe-driven mitigation strategy is proposed and validated on RoSE, achieving state-of-the-art improvements over previous LLM evaluators. The work cautions against blind use of LLM evaluators without robust prompt design and bias mitigation, and it provides practical guidelines to build more robust, reference-free evaluation tools.

Abstract

The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.
Paper Structure (25 sections, 11 figures, 7 tables)

This paper contains 25 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: System text input for prompting chat-based LLMs to generate automatic evaluation scores in text summarization. This prompting strategy is generalized to allow for use of evaluating any metric(s) of interest, whether multiple or just one.
  • Figure 2: Average perplexity for each rating by GPT-4 and Experts. Summaries are grouped by evaluation scores (as assigned either by Experts or by GPT-4). GPT-4 exhibits a disproportionate bias toward low perplexity summaries compared to expert annotators, demonstrating a familiarity bias.
  • Figure 3: Frequencies of each possible score as found in 64,000 predictions using the 1-100 scale. Models sparsely predict scores within the range. Frequencies of some scores, such as 90 and 95, are far higher than 'odd' scores such as 92 or 19, and much of the range is almost entirely ignored (1-60). Interestingly, 1-60 is a range often largely ignored in academic grading scales. This indicates an issue within instruction-following specific to automatic evaluation.
  • Figure 4: (Top) Score distribution for consistency, conditioned on the previously assigned score for coherence when predicting both within the same context. (Bottom) Human-determined scores for consistency conditioned on what range the score fell into for coherence.Human scores are correlated by Pearson's $r = 0.315$, while GPT-4 scores are correlated by $r = 0.979$. The above figures clearly show how previous scores bias the distribution of future scores in the generation. While such biasing is natural (and in part valid), the effect here is so large it harms performance.
  • Figure 5: Scatter-plots of evaluated score versus expert judgements reveal that while many papers claim 0.40 $\tau$ is strong performance, the correlation with human judgements still needs substantial improvements. Even with correlation of over 0.40 Kendall's $\tau$, we notice that any individual evaluation may lie within a very wide range as compared to the ground-truth labeled by experts. Note that the full range of 1-10 is underutilized again.
  • ...and 6 more figures