Validating LLM-as-a-Judge Systems under Rating Indeterminacy

Luke Guerdan; Solon Barocas; Kenneth Holstein; Hanna Wallach; Zhiwei Steven Wu; Alexandra Chouldechova

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, Alexandra Chouldechova

TL;DR

The paper addresses rating indeterminacy in meta-evaluation of LLM-as-a-judge systems, showing that forced-choice elicitation can misrank judge systems. It develops a probabilistic framework linking forced-choice and response-set data, and advocates multi-label metrics such as mean-squared error (MSE) over response sets, with variants MSE $F$ (oracle $\mathbf{F}$) and MSE $\hat{F}$ (estimated from a small auxiliary corpus). Through 11 real-world tasks and 9 LLMs, it demonstrates that standard forced-choice methods can yield up to substantial performance losses relative to methods that account for rating indeterminacy, and that fully specified tasks or multi-label agreement metrics lead to more robust judge selection and downstream performance. The work provides concrete guidelines for principled meta-evaluation, highlights rank-consistency concerns, and makes code and data publicly available to support reproducibility and adoption in practice.

Abstract

The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, plays a critical role in scaling and standardizing GenAI evaluations. To validate such judge systems, evaluators assess human--judge agreement by first collecting multiple human ratings for each item in a validation corpus, then aggregating the ratings into a single, per-item gold label rating. For many items, however, rating criteria may admit multiple valid interpretations, so a human or LLM rater may deem multiple ratings "reasonable" or "correct." We call this condition rating indeterminacy. Problematically, many rating tasks that contain rating indeterminacy rely on forced-choice elicitation, whereby raters are instructed to select only one rating for each item. In this paper, we introduce a framework for validating LLM-as-a-judge systems under rating indeterminacy. We draw theoretical connections between different measures of judge system performance under different human--judge agreement metrics, and different rating elicitation and aggregation schemes. We demonstrate that differences in how humans and LLMs resolve rating indeterminacy when responding to forced-choice rating instructions can heavily bias LLM-as-a-judge validation. Through extensive experiments involving 11 real-world rating tasks and 9 commercial LLMs, we show that standard validation approaches that rely upon forced-choice ratings select judge systems that are highly suboptimal, performing as much as 31% worse than judge systems selected by our approach that uses multi-label "response set" ratings to account for rating indeterminacy. We conclude with concrete recommendations for more principled approaches to LLM-as-a-judge validation.

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

TL;DR

Abstract

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (39)

Theorems & Definitions (11)