Table of Contents
Fetching ...

Are We on the Right Way to Assessing LLM-as-a-Judge?

Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen

TL;DR

This paper introduces Sage, a ground-truth-free framework for evaluating LLMs acting as judges by measuring local consistency (IPI) and global logical coherence (TOV) via a symmetrized, round-robin protocol. Using a 650-question dataset drawn from RewardBench2 and WildChat-1M, Sage demonstrates strong external alignment with established benchmarks and robustness across model types, temperatures, and prompts. The study reveals significant reliability gaps in state-of-the-art LLMs, with improvements from fine-tuning, multi-agent panels, and explicit rubrics to mitigate situational preferences. It also shows human evaluators exhibit notable inconsistency, underscoring Sage’s value as a scalable, cost-effective, and more objective tool for diagnosing and improving LLM evaluators, with practical implications for automated model ranking and RL-based training.

Abstract

LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.

Are We on the Right Way to Assessing LLM-as-a-Judge?

TL;DR

This paper introduces Sage, a ground-truth-free framework for evaluating LLMs acting as judges by measuring local consistency (IPI) and global logical coherence (TOV) via a symmetrized, round-robin protocol. Using a 650-question dataset drawn from RewardBench2 and WildChat-1M, Sage demonstrates strong external alignment with established benchmarks and robustness across model types, temperatures, and prompts. The study reveals significant reliability gaps in state-of-the-art LLMs, with improvements from fine-tuning, multi-agent panels, and explicit rubrics to mitigate situational preferences. It also shows human evaluators exhibit notable inconsistency, underscoring Sage’s value as a scalable, cost-effective, and more objective tool for diagnosing and improving LLM evaluators, with practical implications for automated model ranking and RL-based training.

Abstract

LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.

Paper Structure

This paper contains 45 sections, 25 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Human-annotated preference may not be reliable. We find three key challenges with relying on human annotators for evaluating LLM-as-a-Judge systems. (a) Inter-annotator Disagreement: Different annotators can have conflicting preferences, especially for subjective questions, leading to noisy and inconsistent data. (b) Overlooking Nuances: Annotators may miss subtle errors or inaccuracies in lengthy and complex answers, leading to flawed evaluations. (c) Cognitive Biases: Human evaluators are susceptible to cognitive biases, such as favoring an answer that confirms their false beliefs, which can further compromise the objectivity of the assessment.
  • Figure 2: Sage uses a symmetrized, round-robin protocol to conduct pairwise comparisons on a set of candidate answers. From these judgments, Sage quantifies performance using two metrics: IPI, which measures local consistency by tracking preference reversals (e.g., 3 inconsistent pairs result in an IPI of 0.5), and TOV, which assesses global logical coherence by calculating the minimum alterations required for a consistent ranking (e.g., 3 alternations required). This methodology scalably diagnoses logical deficiencies to help identify and select more reliable LLM evaluators.
  • Figure 3: We provide statistics and analysis of our selected queries and answers within Sage. Distribution of CV values shows the varied difficulty among our two subsets.
  • Figure 4: Ablation results for ChatEval showing performance degradation across all configuration variants (lower scores are better).
  • Figure 5: We discover high IPI and TOV scores in Sage-Hard due to the situational preference phenomenon in LLM-as-a-Judge, while deep thinking and explicit rubrics can mitigate this.
  • ...and 4 more figures