Table of Contents
Fetching ...

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

Benjamin Feuer, Lucas Rosenblatt, Oussama Elachqar

TL;DR

Average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge, is proposed.

Abstract

As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

TL;DR

Average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge, is proposed.

Abstract

As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.
Paper Structure (29 sections, 9 theorems, 46 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 29 sections, 9 theorems, 46 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.3

Consider a judgement context $D$ and a neighbor generator $T$. Let $\Delta := f(D) - f(D')$ where $D' \underset{T}{\sim} D$. Let $M_\sigma(D) = f(D) + Z$ with $Z \sim \mathcal{N}(0, \sigma^2 I_d)$, and further let $B := Z - Z'$, where we set $Z'$ as an independent copy of $Z$, so $B \sim \mathcal{N} or, equivalently, for any $\sigma$ in the admissible interval $0 < \sigma \leq \sigma_{\max}$ where

Figures (4)

  • Figure 1: Bias-bounded evaluation constrains the impact of harms in judge scoring. This before-and-after visualization of the score distributions from an LLM judge (Likert-scale) on the popular Arena-Hard-Auto benchmark shows how true uncertainty can be captured via a compacted score distribution. After the average bias-boundedness (A-BB) algorithm is applied, the original integer-valued score are transformed into a debiased, continuous trajectory which accurately represents the measured uncertainty of the evaluation. The plot shows a KDE density map of the score distribution before and after transformation, with a conservative $\tau = 0.5, \delta = 0.03, \texttt{dim}=500$, averaged across a panel of four judges.
  • Figure 2: Bias-bounded transformation for formatting sensitivity. The blue line in this figure corresponds to the debiased ranking generated by a QwQ-32B judge after using BBE with $\tau=0.5$ to control formatting sensitivity. Even with low $\tau$ tolerance, we are able to retain 88% correlation with the original judgments in this realistic perturbation setting.
  • Figure 3: Bias-bounded evaluation in schematic sensitivity. Even when measured bias is large, we are able to eliminate much potentially biased variance while retaining near-perfect correlation with the original judgments.($\tau=0.5$)
  • Figure 4: Correlative strength varies by judge and by dataset. Although conservative aggregation strategies are always more difficult to debias, and simpler biases such as formatting are consistently easier to debias, other factors, such as the underlying dataset, can also have large effects.

Theorems & Definitions (28)

  • Definition 2.1: Judgment Space
  • Definition 2.2: Rubric Factors
  • Definition 2.3: Bias Space
  • Definition 2.4: Judgment Context
  • Definition 2.5: Neighboring Judgment Contexts
  • Definition 3.1: Root-mean-squared sensitivity
  • Definition 3.2: Average bias boundedness (A-BB)
  • Theorem 3.3: Gaussian mech. for A-BB: a baseline split bound
  • Corollary 3.3: Splitting the failure budget
  • Corollary 3.4: Symmetric split
  • ...and 18 more