Table of Contents
Fetching ...

Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

Jiangnan Fang, Cheng-Tse Liu, Hanieh Deilamsalehy, Nesreen K. Ahmed, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi

TL;DR

Problem: LLM-based judges are increasingly used to evaluate summaries but display biases that threaten reliability. Approach: a large-scale, controlled study tests 9 LLMs (1B–12B) on WikiSum and CNN_DailyMail, with careful length and order controls and enhanced similarity scoring to map judgments to overlap with human texts. Findings: LLM judges prefer generated summaries when overlap with human-written summaries is low, a pattern that persists across model sizes and architectures, and interacts with presentation order; this raises concerns about using LLMs as sole evaluators. Implications: the results suggest that simple overlap metrics are insufficient for LLM-based evaluation and point to the need for alternative judging strategies, bias mitigation, and potential use of LLM-stylistic signals for detection rather than evaluation.

Abstract

Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models' own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.

Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

TL;DR

Problem: LLM-based judges are increasingly used to evaluate summaries but display biases that threaten reliability. Approach: a large-scale, controlled study tests 9 LLMs (1B–12B) on WikiSum and CNN_DailyMail, with careful length and order controls and enhanced similarity scoring to map judgments to overlap with human texts. Findings: LLM judges prefer generated summaries when overlap with human-written summaries is low, a pattern that persists across model sizes and architectures, and interacts with presentation order; this raises concerns about using LLMs as sole evaluators. Implications: the results suggest that simple overlap metrics are insufficient for LLM-based evaluation and point to the need for alternative judging strategies, bias mitigation, and potential use of LLM-stylistic signals for detection rather than evaluation.

Abstract

Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models' own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.
Paper Structure (9 sections, 7 figures, 1 table)

This paper contains 9 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Prompt used for the LLMs to generate initial summaries to be evaluated
  • Figure 2: Prompt used for the LLMs to generate initial summaries to be evaluated.
  • Figure 3: Visual representation of evaluator choice labels. "GT" means the evaluator chooses the ground truth summary in both orders; "Generated" means the evaluator chooses the LLM-generated summary in both orders. "Tied-chose-first" means the evaluator chooses the first presented summary in both orders, and "Tied-chose-last" means the evaluator chooses the last presented summary in both orders.
  • Figure 4: Proportion of documents where the evaluator chooses ground truth (GT), generated summaries, and when the evaluator chose first, chose last regardless of order, plotted against the score of the non-ground truth summaries. See Figure \ref{['fig:labels-cats']} for a visual representation of evaluator choices. The score is the mean of ROUGE-1, ROUGE-2, BLEU-1, and BLEU-4. Here the generators (summarizers) are Gemma 3, Phi 4 mini, Mistral, Llama 3, and GPT-4o mini (i.e. no Qwen 3). For Llama 3 8B and Mistral, additional summaries are generated with different prompts to ascertain possible patterns in higher-scored summaries. Further breakdowns for the variants Llama and Gemma can be found in Figure \ref{['fig:model-variants']}.
  • Figure 5: Alternative version of Figure \ref{['fig:stacked-gt-vs-gen']}, where each row of the grid are the results with the same summarizer instead.
  • ...and 2 more figures