UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge
Yang Zhang, Cunxiang Wang, Lindong Wu, Wenbo Yu, Yidong Wang, Guangsheng Bao, Jie Tang
TL;DR
The paper tackles bias in pairwise LLM evaluations, showing substantial inter-judge disagreement across judges. It proposes UDA (Unsupervised Debiasing Alignment), which replaces static Elo updates with an instance-level, learned correction via an Adaptive Debiasing Network that outputs a dynamic K-factor and soft win-labels, guided by a Consensus Anchor without human labels. By minimizing dispersion among judges’ Elo trajectories and aligning toward a collective consensus, UDA both reduces variance (up to 63.4%) and improves alignment with human judgments (up to +24.7%), while enabling zero-shot transfer to unseen tasks. The approach is theoretically motivated as variance-busting via consensus shrinkage and is validated through extensive experiments on ArenaHard and a Human-Annotated Transfer Set, with ablations showing the essential role of self-awareness features. The work offers a scalable, model-agnostic path to more robust, reproducible LLM evaluation and releases code and data for reproducibility.
Abstract
Pairwise evaluation of Large Language Models (LLMs) is a common paradigm, but it is prone to preference bias, where judges systematically favor certain outputs, such as their own. This bias leads to inconsistent and skewed rankings across different judges. To address this, we first empirically demonstrate significant and heterogeneous biases in cross-model evaluations. We then propose UDA (Unsupervised Debiasing Alignment), a framework that reduces inter-judge disagreement by dynamically adjusting the Elo rating system. For each pairwise comparison, a compact neural network learns to adaptively set the K-factor and refine win probabilities. Crucially, UDA operates in a fully unsupervised manner, guided solely by the objective of minimizing the dispersion among the Elo trajectories of all judges. This forces an alignment towards a collective consensus, which serves as an unsupervised proxy for a more stable and reproducible evaluation. In addition, we provide theoretical motivation demonstrating how alignment towards a consensus can reduce aggregate system bias. Experiments show that UDA significantly reduces the inter-judge rating standard deviation by up to 63.4% and improves the average correlation with human judgments by 24.7%. Notably, UDA elevates the performance of poorly performing judges to achieve parity with high-quality ones, fostering a more robust and reliable evaluation ecosystem. Code and data are available at https://anonymous.4open.science/r/62AB93CD-23B4.
