Table of Contents
Fetching ...

Efficient Bayesian Inference from Noisy Pairwise Comparisons

Till Aczel, Lucas Theis, Wattenhofer Roger

TL;DR

BBQ addresses the challenge of aggregating noisy pairwise comparisons for evaluating generative models. It introduces a Bayesian Bradley-Terry model that jointly estimates item skills $\lambda_i$ and rater qualities $q_r$ via an EM algorithm with Thurstonian latent-variable interpretation. The approach yields faster convergence, calibrated uncertainty estimates, and robust, interpretable rankings in crowdsourced settings, outperforming baseline BT methods. The method generalizes to other domains requiring reliable aggregation of noisy pairwise judgments, enabling cost-effective human evaluation of AI systems.

Abstract

Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.

Efficient Bayesian Inference from Noisy Pairwise Comparisons

TL;DR

BBQ addresses the challenge of aggregating noisy pairwise comparisons for evaluating generative models. It introduces a Bayesian Bradley-Terry model that jointly estimates item skills and rater qualities via an EM algorithm with Thurstonian latent-variable interpretation. The approach yields faster convergence, calibrated uncertainty estimates, and robust, interpretable rankings in crowdsourced settings, outperforming baseline BT methods. The method generalizes to other domains requiring reliable aggregation of noisy pairwise judgments, enabling cost-effective human evaluation of AI systems.

Abstract

Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.

Paper Structure

This paper contains 29 sections, 33 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Scaling behavior of Bradley--Terry variants (Crowd-BT caron2012efficient, Bayes-BT chen2013pairwise, BBQ (ours)) on the IHQ-all dataset. Left: Performance vs. number of raters. Right: Performance vs. number of comparisons per rater. Both Top-1 agreement and Kendall's $\tau$ improve noticeably with more raters or comparisons. While Top-1 agreement differentiates between models, Kendall's $\tau$ remains similar across models. Crowd-BT fails to converge with very few raters, highlighting the EM algorithm's advantage. Crowd-BT and BBQ perform similarly under sparse data, but BBQ outperforms Bayes-BT as the number of raters or comparisons grows.
  • Figure 2: Scatter plot of rater agreement with the final ranking (x-axis) versus the predicted rater quality (y-axis) for the IHQ datasets. Each point corresponds to an individual rater. Triangles denote the filtered dataset, and squares denote the unfiltered dataset.
  • Figure 3: Average computation time in seconds (log-scale) for three models measured on a single bootstrapped sample across eight datasets. While Crowd-BT can require substantial computation time on some datasets, BBQ consistently remains fast across all datasets.
  • Figure 4: Screenshot of the Mabyduck user study platform used for collecting pairwise comparisons. A reference image is shown on the left, and the rater selects between two compressed images on the right.
  • Figure 5: Four example images from the pre-screening process for raters. The first two are color blindness tests, where raters must identify the number displayed in each pattern. The last two are shape detection tests designed to evaluate sensitivity to low-contrast objects: one light gray shape on a white background and one dark gray shape on a black background.
  • ...and 1 more figures