Table of Contents
Fetching ...

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

Mingyang Song, Mao Zheng, Chenning Xu

Abstract

The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs $\times$ 3 frontier judges $\times$ 100 tasks $\times$ 11 temperatures), we show that model-level agreement (Spearman $ρ= 0.99$) masks fragile sample-level agreement (Pearson $\bar{r} = 0.72$; absolute agreement ICC $= 0.67$), that merely sharing rubric structure restores 62\% of total agreement, and that high-quality outputs paradoxically receive the \textit{least} consistent evaluations. \textbf{Second}, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textit{increases} in codified domains (Education +22\%, Academic +27\%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

Abstract

The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs 3 frontier judges 100 tasks 11 temperatures), we show that model-level agreement (Spearman ) masks fragile sample-level agreement (Pearson ; absolute agreement ICC ), that merely sharing rubric structure restores 62\% of total agreement, and that high-quality outputs paradoxically receive the \textit{least} consistent evaluations. \textbf{Second}, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textit{increases} in codified domains (Education +22\%, Academic +27\%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.
Paper Structure (85 sections, 1 equation, 6 figures, 9 tables)

This paper contains 85 sections, 1 equation, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Illustration of Evaluation Illusion. Without knowledge grounding, evaluators form a "Shared Illusion," unanimously rewarding the professional formatting of a fundamentally flawed business pitch. MERG forces knowledge activation, revealing that Claude penalizes the regulatory violation while Gemini continues to reward surface heuristics. GPT (3.7, not shown) penalizes even more harshly. Full analysis with all three evaluators is in Appendix \ref{['case:formatting_trap']}.
  • Figure 2: Knowledge injection systematically reduces evaluator agreement. Baseline vs. MERG agreement across 11 temperatures. The persistent gap ($\Delta_K < 0$) indicates that baseline agreement is heavily reliant on shared surface heuristics rather than substantive deliberation. The gap is largest at moderate temperatures ($t \approx 0.3$) and narrows at the extremes as both conditions converge toward lower agreement.
  • Figure 3: The Rubric Commensurability Problem. Agreement across MERG ablation variants. Independent evaluation (Original) yields near-random agreement; standardizing rubric structure (5-Dim) drives the largest increase, indicating most reported agreement is structural rather than substantive.
  • Figure 4: Domain-selective knowledge effects.$\Delta_K$ by domain at $t{=}0.0$. Knowledge increases agreement in codified domains (Education, Academic) but decreases it in subjective ones (Literature), ruling out the noise hypothesis.
  • Figure 5: Sample-level agreement is universally fragile. Pearson $r$ across 32 LLMs at 11 temperatures ($t \in [0.0, 1.0]$). Each subplot shows one model; lines represent evaluator pairs (Claude/Gemini, Claude/GPT, Gemini/GPT); background shading indicates model type (Base, Instruct, Thinking). The pattern is consistent: agreement is moderate ($r \approx 0.5$ to $0.8$) and largely temperature-invariant, confirming that the Resolution Paradox (high model-level $\rho$, low sample-level $r$) is a universal property of LLM evaluation, not a model-specific artifact.
  • ...and 1 more figures