Table of Contents
Fetching ...

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, Alexandra Chouldechova

TL;DR

The paper addresses rating indeterminacy in meta-evaluation of LLM-as-a-judge systems, showing that forced-choice elicitation can misrank judge systems. It develops a probabilistic framework linking forced-choice and response-set data, and advocates multi-label metrics such as mean-squared error (MSE) over response sets, with variants MSE $F$ (oracle $\mathbf{F}$) and MSE $\hat{F}$ (estimated from a small auxiliary corpus). Through 11 real-world tasks and 9 LLMs, it demonstrates that standard forced-choice methods can yield up to substantial performance losses relative to methods that account for rating indeterminacy, and that fully specified tasks or multi-label agreement metrics lead to more robust judge selection and downstream performance. The work provides concrete guidelines for principled meta-evaluation, highlights rank-consistency concerns, and makes code and data publicly available to support reproducibility and adoption in practice.

Abstract

The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, plays a critical role in scaling and standardizing GenAI evaluations. To validate such judge systems, evaluators assess human--judge agreement by first collecting multiple human ratings for each item in a validation corpus, then aggregating the ratings into a single, per-item gold label rating. For many items, however, rating criteria may admit multiple valid interpretations, so a human or LLM rater may deem multiple ratings "reasonable" or "correct." We call this condition rating indeterminacy. Problematically, many rating tasks that contain rating indeterminacy rely on forced-choice elicitation, whereby raters are instructed to select only one rating for each item. In this paper, we introduce a framework for validating LLM-as-a-judge systems under rating indeterminacy. We draw theoretical connections between different measures of judge system performance under different human--judge agreement metrics, and different rating elicitation and aggregation schemes. We demonstrate that differences in how humans and LLMs resolve rating indeterminacy when responding to forced-choice rating instructions can heavily bias LLM-as-a-judge validation. Through extensive experiments involving 11 real-world rating tasks and 9 commercial LLMs, we show that standard validation approaches that rely upon forced-choice ratings select judge systems that are highly suboptimal, performing as much as 31% worse than judge systems selected by our approach that uses multi-label "response set" ratings to account for rating indeterminacy. We conclude with concrete recommendations for more principled approaches to LLM-as-a-judge validation.

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

TL;DR

The paper addresses rating indeterminacy in meta-evaluation of LLM-as-a-judge systems, showing that forced-choice elicitation can misrank judge systems. It develops a probabilistic framework linking forced-choice and response-set data, and advocates multi-label metrics such as mean-squared error (MSE) over response sets, with variants MSE (oracle ) and MSE (estimated from a small auxiliary corpus). Through 11 real-world tasks and 9 LLMs, it demonstrates that standard forced-choice methods can yield up to substantial performance losses relative to methods that account for rating indeterminacy, and that fully specified tasks or multi-label agreement metrics lead to more robust judge selection and downstream performance. The work provides concrete guidelines for principled meta-evaluation, highlights rank-consistency concerns, and makes code and data publicly available to support reproducibility and adoption in practice.

Abstract

The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, plays a critical role in scaling and standardizing GenAI evaluations. To validate such judge systems, evaluators assess human--judge agreement by first collecting multiple human ratings for each item in a validation corpus, then aggregating the ratings into a single, per-item gold label rating. For many items, however, rating criteria may admit multiple valid interpretations, so a human or LLM rater may deem multiple ratings "reasonable" or "correct." We call this condition rating indeterminacy. Problematically, many rating tasks that contain rating indeterminacy rely on forced-choice elicitation, whereby raters are instructed to select only one rating for each item. In this paper, we introduce a framework for validating LLM-as-a-judge systems under rating indeterminacy. We draw theoretical connections between different measures of judge system performance under different human--judge agreement metrics, and different rating elicitation and aggregation schemes. We demonstrate that differences in how humans and LLMs resolve rating indeterminacy when responding to forced-choice rating instructions can heavily bias LLM-as-a-judge validation. Through extensive experiments involving 11 real-world rating tasks and 9 commercial LLMs, we show that standard validation approaches that rely upon forced-choice ratings select judge systems that are highly suboptimal, performing as much as 31% worse than judge systems selected by our approach that uses multi-label "response set" ratings to account for rating indeterminacy. We conclude with concrete recommendations for more principled approaches to LLM-as-a-judge validation.

Paper Structure

This paper contains 34 sections, 5 theorems, 25 equations, 39 figures, 5 tables.

Key Result

Theorem 3.1

Under our rating model (Figure fig:rating_model), the response set distribution is identifiable from the forced-choice distribution if and only if the rating task is fully specified.

Figures (39)

  • Figure 1: Examples of rating indeterminacy in (1) toxic language, (2) factuality, (3) helpfulness, and (4) relevance rating tasks. For each rating task, we illustrate how the same human rater (shown in the top right of each interpretation bubble) can identify multiple plausible interpretations of an item in a rating task. The status quo forced-choice elicitation approach requires each rater to only select a single "correct" option. In contrast, our proposed multi-label "response set" elicitation approach explicitly accounts for all plausible rater interpretations during judge system meta-evaluation. See Figure \ref{['fig:ind_examples_full']} for additional examples with demeaning language and physical safety threat rating tasks.
  • Figure 2: Framework applied to "toxicity" task (Civil Comments civil_comments). Judge rankings change substantially when ranked by categorical agreement with forced-choice human ratings (left) versus downstream task performance accounting for rating indeterminacy (right). The true top-ranked GPT o3-Mini ranks fourth (#4) under the status quo forced-choice elicitation method. The top-ranked judge under forced-choice elicitation, Claude Sonnet 3.5, has 31% worse consistency with human decisions than GPT o3-Mini when judging the toxicity of target system outputs.
  • Figure 3: Our framework connects two key meta-evaluation goals: measuring general-purpose human--judge agreement metrics (left), and validating judge systems on specific downstream evaluation tasks (right). Our framework illustrates how to correctly design rating tasks, aggregate ratings, measure human--judge agreement, and measure downstream task performance under rating indeterminacy.
  • Figure 4: An illustration of our rating model applied to an item in an underspecified ($\S$\ref{['subsec:agreement_limitations']}) Yes/No rating task. The response set distribution ($\boldsymbol{\theta}^*_i$) and forced-choice distribution ($\mathbf{O}_i$) denote probability vectors over response sets and forced-choice options, respectively. Left:$\mathbf{O}_i$ is recovered from multiplying $\boldsymbol{\theta}^*_i$ by the forced-choice translation matrix ($\mathbf{F}_i$). Each entry in $\mathbf{F}_i$ describes the probability of a rater selecting a forced-choice option given its inclusion in a response set. Right: The multi-label vector ($\mathbf{\Omega}_i$) is recovered by multiplying the response set distribution by an option lookup table ($\mathbf{\Lambda}$). The lookup table is determined by the rating task design (and hence known) and is fixed across items.
  • Figure 5: Sub-optimality from selecting a judge system via human--judge agreement metrics vs. directly on the downstream task metric. Results aggregated across 11 tasks, 9 systems, and a sweep of $\tau$ values. As $\beta^H$ increases, performance gaps widen and forced-choice metrics become less reliable.
  • ...and 34 more figures

Theorems & Definitions (11)

  • Theorem 3.1
  • Theorem C.3
  • Theorem C.4: Response Set Identifiably
  • Definition C.5: Rank Consistency
  • Definition C.6: Monotone Transformation
  • Theorem C.7
  • proof
  • proof
  • proof
  • Lemma C.8: Rank Consistency of MSE (srs/srs) Under Rater Error
  • ...and 1 more