Table of Contents
Fetching ...

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

Benjamin Feuer, Chiung-Yi Tseng, Astitwa Sarthak Lathe, Oussama Elachqar, John P Dickerson

TL;DR

This work tackles the validity challenges of LLM-judged benchmarks by introducing two diagnostics—Schematic Adherence and Psychometric Validity—to quantify how well rubric factors drive overall judgments and to measure residual uncertainty. The methods are applied to Arena-Hard Auto, revealing severe rubric incoherence and factor collapse, with cross-factor correlations often exceeding 0.93 and substantial unexplained variance; moreover, ELO-style aggregation creates a false sense of stability by producing $R^2$ values near $0.998$. The authors benchmark design principles to tighten objectives, audit factor structure, and report uncertainty, arguing for reliability-aware benchmarks and providing open-source code and data for reproducibility. Overall, the paper highlights that many LLM-judged evaluations can be invalid or misleading if not scrutinized for adherence and discriminant validity, and it offers practical guidance to restore validity in open-ended AI evaluation.

Abstract

LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge's overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We released our code and dataset at https://github.com/penfever/judgment-to-noise

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

TL;DR

This work tackles the validity challenges of LLM-judged benchmarks by introducing two diagnostics—Schematic Adherence and Psychometric Validity—to quantify how well rubric factors drive overall judgments and to measure residual uncertainty. The methods are applied to Arena-Hard Auto, revealing severe rubric incoherence and factor collapse, with cross-factor correlations often exceeding 0.93 and substantial unexplained variance; moreover, ELO-style aggregation creates a false sense of stability by producing values near . The authors benchmark design principles to tighten objectives, audit factor structure, and report uncertainty, arguing for reliability-aware benchmarks and providing open-source code and data for reproducibility. Overall, the paper highlights that many LLM-judged evaluations can be invalid or misleading if not scrutinized for adherence and discriminant validity, and it offers practical guidance to restore validity in open-ended AI evaluation.

Abstract

LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge's overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We released our code and dataset at https://github.com/penfever/judgment-to-noise

Paper Structure

This paper contains 28 sections, 18 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: The majority of true judgment variance has no known cause. On the Arena-Hard-Auto benchmark, with a rubric specifying 5 judgment criteria, we find that across four judges and two settings (different cohorts of models to be compared), approximately 55% of variance, on average, is unexplained either by linear or taylor-series polynomial factor analysis on the rubric criteria. After ELO transformation, the linear model explains 100% of observed variance, indicating that, by enforcing transitivity, ELO hides true latent uncertainty in multi-factor analysis.
  • Figure 2: Psychometric validity summary for Setting 1 across four judges. Bars show (top-left) Cronbach's $\alpha$ (internal consistency); (top-right) cross-loading ratio (CLR; higher indicates stronger factor separation, shown also as a normalized $[0,1]$ score in the validity computation); (bottom-left) HTMT computed on absolute item-level correlations (lower is better; the dashed line marks the 0.85 threshold); and (bottom-right) overall validity score combining $\alpha$ and discriminant components via a harmonic mean and applying a multiplicative penalty $(1-\phi)$ for factor-wise failure rate $\phi$ (share of unscorable judgments such as "Safety: N/A"). Atypical results appear primarily on the safety factor for GPT-3.5, where high failure rates depress the overall score.
  • Figure 3: Most factors are highly correlated for most judges. In Setting 1, across all four judges, the average spearman rank correlation matrix shows high cross-factor correlations ($>0.93$ for most pairs). This suggests factor collapse -- the inability of judges to meaningfully distinguish between semantically distinct rubric factors in the setting.
  • Figure 4: ELO-style aggregation compresses multi-dimensional, noisy judgments into apparently smooth rankings, masking upstream uncertainty. Even more so than in general-case factor analysis, the feature loadings under ELO exhibit clear and strong positive or negative correlation.
  • Figure 5: Diverse judges exhibit very similar latent factor loadings. Across four different LLM judges in benchmark Setting 2, the eigenvalues associated with factor weightings are highly similar; all indicate a collapse of significance in the latent loadings.
  • ...and 5 more figures

Theorems & Definitions (11)

  • Definition 1: Linear Schematic Model
  • Definition 2: Non-linear Schematic Model
  • Definition 3: Context-Dependent Schematic Patterns
  • Definition 4: Schematic Adherence Score
  • Definition 5: Integration Bias Metrics
  • Definition 6: Cronbach's Alpha
  • Definition 7: Cross-loading Ratio
  • Definition 8: HTMT Ratio
  • Definition 9: Bounded CLR Normalization
  • Definition 10: Failure Rate
  • ...and 1 more