Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks
Rajarshi Haldar, Julia Hockenmaier
TL;DR
This study investigates the intra-rater reliability of LLMs used as judges for NLG evaluation, revealing substantial self-variability across runs and benchmarks. Using Krippendorff's Alpha, it shows that even larger models like Qwen-3 can approach but not consistently reach acceptable reliability, and that consensus via multiple runs can improve, but not fully fix, alignment with human judgments. The results span SummaC, SummEval, and MT-Bench, indicating that LLM judges cannot yet reliably substitute human evaluators, particularly for subjective metrics. The paper provides practical guidelines: report self-reliability, explore variance reduction strategies that preserve accuracy, and collect human self-reliability data to calibrate LLM judgments.
Abstract
As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.
