Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

Benjamin Feuer; Lucas Rosenblatt; Oussama Elachqar

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

Benjamin Feuer, Lucas Rosenblatt, Oussama Elachqar

TL;DR

Average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge, is proposed.

Abstract

As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

TL;DR

Average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge, is proposed.

Abstract

Paper Structure (29 sections, 9 theorems, 46 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 29 sections, 9 theorems, 46 equations, 4 figures, 1 table, 1 algorithm.

Introduction
Contributions
Bias-Bounded Evaluation
Basic Definitions
Judgment Context and Neighboring Contexts
Average Bias-Bounded Evaluation (A-BB)
Defining A-BB
Lipschitz shrinkage of data
Experiments
Controlling formatting sensitivity bias.
Controlling schematic bias.
Related Work
Formal guarantees and uncertainty quantification.
Scoring bias and agreeableness.
Limitations
...and 14 more sections

Key Result

Theorem 3.3

Consider a judgement context $D$ and a neighbor generator $T$. Let $\Delta := f(D) - f(D')$ where $D' \underset{T}{\sim} D$. Let $M_\sigma(D) = f(D) + Z$ with $Z \sim \mathcal{N}(0, \sigma^2 I_d)$, and further let $B := Z - Z'$, where we set $Z'$ as an independent copy of $Z$, so $B \sim \mathcal{N} or, equivalently, for any $\sigma$ in the admissible interval $0 < \sigma \leq \sigma_{\max}$ where

Figures (4)

Figure 1: Bias-bounded evaluation constrains the impact of harms in judge scoring. This before-and-after visualization of the score distributions from an LLM judge (Likert-scale) on the popular Arena-Hard-Auto benchmark shows how true uncertainty can be captured via a compacted score distribution. After the average bias-boundedness (A-BB) algorithm is applied, the original integer-valued score are transformed into a debiased, continuous trajectory which accurately represents the measured uncertainty of the evaluation. The plot shows a KDE density map of the score distribution before and after transformation, with a conservative $\tau = 0.5, \delta = 0.03, \texttt{dim}=500$, averaged across a panel of four judges.
Figure 2: Bias-bounded transformation for formatting sensitivity. The blue line in this figure corresponds to the debiased ranking generated by a QwQ-32B judge after using BBE with $\tau=0.5$ to control formatting sensitivity. Even with low $\tau$ tolerance, we are able to retain 88% correlation with the original judgments in this realistic perturbation setting.
Figure 3: Bias-bounded evaluation in schematic sensitivity. Even when measured bias is large, we are able to eliminate much potentially biased variance while retaining near-perfect correlation with the original judgments.($\tau=0.5$)
Figure 4: Correlative strength varies by judge and by dataset. Although conservative aggregation strategies are always more difficult to debias, and simpler biases such as formatting are consistently easier to debias, other factors, such as the underlying dataset, can also have large effects.

Theorems & Definitions (28)

Definition 2.1: Judgment Space
Definition 2.2: Rubric Factors
Definition 2.3: Bias Space
Definition 2.4: Judgment Context
Definition 2.5: Neighboring Judgment Contexts
Definition 3.1: Root-mean-squared sensitivity
Definition 3.2: Average bias boundedness (A-BB)
Theorem 3.3: Gaussian mech. for A-BB: a baseline split bound
Corollary 3.3: Splitting the failure budget
Corollary 3.4: Symmetric split
...and 18 more

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

TL;DR

Abstract

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (28)