Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM-as-a-judge
Yassir Fathullah, Mark J. F. Gales
TL;DR
This work tackles evaluating LLM-generated text with limited labelled data by framing LLM-as-a-judge within a generalized probabilistic model. It generalises Product-of-Experts (PoE) for comparing candidates, introduces generalized comparative experts and combines absolute with comparative scoring in a unified framework, including a learnable home-advantage term. Key contributions include Beta and Gaussian extensions for the pairwise likelihood, two targeted uncertainty measures (pairwise and ranking) with a Laplace-based posterior, and the probability-of-reordering criterion for efficient iterative selection, achieving notable efficiency gains. Empirical results on SummEval and HANNA show that uncertainty estimation largely drives efficiency improvements (roughly 50% fewer comparisons) and that ranking-level uncertainty helps identify low-performing predictions, while the exact choice of expert model has limited impact when uncertainty estimation is properly exploited.
Abstract
This paper explores generalised probabilistic modelling and uncertainty estimation in comparative LLM-as-a-judge frameworks. We show that existing Product-of-Experts methods are specific cases of a broader framework, enabling diverse modelling options. Furthermore, we propose improved uncertainty estimates for individual comparisons, enabling more efficient selection and achieving strong performance with fewer evaluations. We also introduce a method for estimating overall ranking uncertainty. Finally, we demonstrate that combining absolute and comparative scoring improves performance. Experiments show that the specific expert model has a limited impact on final rankings but our proposed uncertainty estimates, especially the probability of reordering, significantly improve the efficiency of systems reducing the number of needed comparisons by ~50%. Furthermore, ranking-level uncertainty metrics can be used to identify low-performing predictions, where the nature of the probabilistic model has a notable impact on the quality of the overall uncertainty.
