Efficient Bayesian Inference from Noisy Pairwise Comparisons
Till Aczel, Lucas Theis, Wattenhofer Roger
TL;DR
BBQ addresses the challenge of aggregating noisy pairwise comparisons for evaluating generative models. It introduces a Bayesian Bradley-Terry model that jointly estimates item skills $\lambda_i$ and rater qualities $q_r$ via an EM algorithm with Thurstonian latent-variable interpretation. The approach yields faster convergence, calibrated uncertainty estimates, and robust, interpretable rankings in crowdsourced settings, outperforming baseline BT methods. The method generalizes to other domains requiring reliable aggregation of noisy pairwise judgments, enabling cost-effective human evaluation of AI systems.
Abstract
Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.
