Table of Contents
Fetching ...

UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge

Yang Zhang, Cunxiang Wang, Lindong Wu, Wenbo Yu, Yidong Wang, Guangsheng Bao, Jie Tang

TL;DR

The paper tackles bias in pairwise LLM evaluations, showing substantial inter-judge disagreement across judges. It proposes UDA (Unsupervised Debiasing Alignment), which replaces static Elo updates with an instance-level, learned correction via an Adaptive Debiasing Network that outputs a dynamic K-factor and soft win-labels, guided by a Consensus Anchor without human labels. By minimizing dispersion among judges’ Elo trajectories and aligning toward a collective consensus, UDA both reduces variance (up to 63.4%) and improves alignment with human judgments (up to +24.7%), while enabling zero-shot transfer to unseen tasks. The approach is theoretically motivated as variance-busting via consensus shrinkage and is validated through extensive experiments on ArenaHard and a Human-Annotated Transfer Set, with ablations showing the essential role of self-awareness features. The work offers a scalable, model-agnostic path to more robust, reproducible LLM evaluation and releases code and data for reproducibility.

Abstract

Pairwise evaluation of Large Language Models (LLMs) is a common paradigm, but it is prone to preference bias, where judges systematically favor certain outputs, such as their own. This bias leads to inconsistent and skewed rankings across different judges. To address this, we first empirically demonstrate significant and heterogeneous biases in cross-model evaluations. We then propose UDA (Unsupervised Debiasing Alignment), a framework that reduces inter-judge disagreement by dynamically adjusting the Elo rating system. For each pairwise comparison, a compact neural network learns to adaptively set the K-factor and refine win probabilities. Crucially, UDA operates in a fully unsupervised manner, guided solely by the objective of minimizing the dispersion among the Elo trajectories of all judges. This forces an alignment towards a collective consensus, which serves as an unsupervised proxy for a more stable and reproducible evaluation. In addition, we provide theoretical motivation demonstrating how alignment towards a consensus can reduce aggregate system bias. Experiments show that UDA significantly reduces the inter-judge rating standard deviation by up to 63.4% and improves the average correlation with human judgments by 24.7%. Notably, UDA elevates the performance of poorly performing judges to achieve parity with high-quality ones, fostering a more robust and reliable evaluation ecosystem. Code and data are available at https://anonymous.4open.science/r/62AB93CD-23B4.

UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge

TL;DR

The paper tackles bias in pairwise LLM evaluations, showing substantial inter-judge disagreement across judges. It proposes UDA (Unsupervised Debiasing Alignment), which replaces static Elo updates with an instance-level, learned correction via an Adaptive Debiasing Network that outputs a dynamic K-factor and soft win-labels, guided by a Consensus Anchor without human labels. By minimizing dispersion among judges’ Elo trajectories and aligning toward a collective consensus, UDA both reduces variance (up to 63.4%) and improves alignment with human judgments (up to +24.7%), while enabling zero-shot transfer to unseen tasks. The approach is theoretically motivated as variance-busting via consensus shrinkage and is validated through extensive experiments on ArenaHard and a Human-Annotated Transfer Set, with ablations showing the essential role of self-awareness features. The work offers a scalable, model-agnostic path to more robust, reproducible LLM evaluation and releases code and data for reproducibility.

Abstract

Pairwise evaluation of Large Language Models (LLMs) is a common paradigm, but it is prone to preference bias, where judges systematically favor certain outputs, such as their own. This bias leads to inconsistent and skewed rankings across different judges. To address this, we first empirically demonstrate significant and heterogeneous biases in cross-model evaluations. We then propose UDA (Unsupervised Debiasing Alignment), a framework that reduces inter-judge disagreement by dynamically adjusting the Elo rating system. For each pairwise comparison, a compact neural network learns to adaptively set the K-factor and refine win probabilities. Crucially, UDA operates in a fully unsupervised manner, guided solely by the objective of minimizing the dispersion among the Elo trajectories of all judges. This forces an alignment towards a collective consensus, which serves as an unsupervised proxy for a more stable and reproducible evaluation. In addition, we provide theoretical motivation demonstrating how alignment towards a consensus can reduce aggregate system bias. Experiments show that UDA significantly reduces the inter-judge rating standard deviation by up to 63.4% and improves the average correlation with human judgments by 24.7%. Notably, UDA elevates the performance of poorly performing judges to achieve parity with high-quality ones, fostering a more robust and reliable evaluation ecosystem. Code and data are available at https://anonymous.4open.science/r/62AB93CD-23B4.

Paper Structure

This paper contains 47 sections, 2 theorems, 17 equations, 5 figures, 6 tables, 2 algorithms.

Key Result

Theorem 3.1

Let $R_i^*$ be the unknown true Elo score for model $i$. Let $R_i^{(k)}$ be the score assigned by judge $k$, which includes a bias term $\epsilon_i^{(k)} = R_i^{(k)} - R_i^*$. The UDA procedure, by optimizing each judge's score towards the consensus, is motivated by the principle that this reduces t

Figures (5)

  • Figure 1: A. Some models prefer their own answers, and mitigating this bias from using different LLMs can make the results more accurate. B. Dynamically recalibrate, rather than naively adopts, LLM-judged scores, using consensus among judges as the supervisory label in lieu of human annotation.
  • Figure 2: Pairwise answer evaluation experiments were conducted using different large models as judges on the same Arena Hard dataset. The figure shows that LLMs exhibit self-preferential bias: some over-rate their own answers relative to other judges, others under-rate them. On ArenaHard (the depicted dataset) this bias ranges from $-$38% to +90%, while on our dataset it spans $-$21% to +56%.
  • Figure 3: Score stability across ten judge llms. Left: Baseline Elo. Right: UDA. Our method markedly aligns scores across diverse LLM judges, yielding significantly lower inter-judge variance.
  • Figure 4: Per-model Pearson correlation ($\uparrow$) between the judge scores computed with the baseline method , UDA method and consensus result, respectively, and the human-annotated ground-truth scores. Notably, consensus scores correlate best with human judgments; the uniform correlation of 0.89 arises because the consensus is shared across all judges, irrespective of individual model participation. This validates the consensus as a robust optimization target.
  • Figure 5: Judge-score heat-maps on Human-Annotated Transfer Set. After refinement with UDA, the scores judged by different LLM judges converge markedly, yielding a visibly narrower chromatic variance within each column of the heatmap.

Theorems & Definitions (4)

  • Theorem 3.1: Principle of Aggregate Bias Reduction
  • proof : Proof Sketch (Illustrative)
  • Theorem A.1
  • proof