Evaluating Scoring Bias in LLM-as-a-Judge
Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu
TL;DR
This paper investigates scoring bias in LLM-as-a-Judge, focusing on perturbations to the scoring prompt rather than evaluation targets. It defines three novel biases—rubric order bias, score ID bias, and reference answer score bias—and introduces an automatic data-synthesis pipeline and a formal evaluation framework with stability, accuracy, and scoring-tendency metrics. Across four benchmarks and multiple judge models, the study shows that scoring bias persists even in strong models like GPT-4o, with model scale influencing robustness and full-mark reference answers generally improving accuracy. The findings yield practical guidance for designing robust scoring prompts and mitigating biases, advancing reliable, scalable automated evaluation in real-world applications.
Abstract
The "LLM-as-a-Judge" paradigm, using Large Language Models (LLMs) as automated evaluators, is pivotal to LLM development, offering scalable feedback for complex tasks. However, the reliability of these judges is compromised by various biases. Existing research has heavily concentrated on biases in comparative evaluations. In contrast, scoring-based evaluations-which assign an absolute score and are often more practical in industrial applications-remain under-investigated. To address this gap, we undertake the first dedicated examination of scoring bias in LLM judges. We shift the focus from biases tied to the evaluation targets to those originating from the scoring prompt itself. We formally define scoring bias and identify three novel, previously unstudied types: rubric order bias, score ID bias, and reference answer score bias. We propose a comprehensive framework to quantify these biases, featuring a suite of multi-faceted metrics and an automatic data synthesis pipeline to create a tailored evaluation corpus. Our experiments empirically demonstrate that even the most advanced LLMs suffer from these substantial scoring biases. Our analysis yields actionable insights for designing more robust scoring prompts and mitigating these newly identified biases.
