Table of Contents
Fetching ...

Score-Based Density Estimation from Pairwise Comparisons

Petrus Mikkola, Luigi Acerbi, Arto Klami

TL;DR

The paper tackles learning a target density from pairwise comparisons by linking it to a tempered marginal winner density via a position-dependent tempering field, enabling score-based estimation. It proves that under Bradley–Terry and exponential RUMs the scores satisfy $ abla \log p(oldsymbol{x}) = \tau(\boldsymbol{x}) \nabla \log p_w(\boldsymbol{x})$, and develops a practical diffusion-based method to recover $p$ by estimating $p_w$ and the tempering field, then sampling from $p$ with score-scaled ALD. The approach jointly models the joint and marginal densities to obtain the MWD score, estimates the tempering field through a learned density ratio under BT, and demonstrates improved accuracy over prior flow-based methods on synthetic targets and a real-data proxy (LLM), using hundreds to thousands of pairwise queries. The work advances expert-knowledge elicitation and potential fine-tuning of generative systems by providing a principled, score-based mechanism to represent and sample from human beliefs. Overall, it delivers a theoretically grounded, practically effective framework for density estimation from limited preference data with broad applicability.

Abstract

We study density estimation from pairwise comparisons, motivated by expert knowledge elicitation and learning from human feedback. We relate the unobserved target density to a tempered winner density (marginal density of preferred choices), learning the winner's score via score-matching. This allows estimating the target by `de-tempering' the estimated winner density's score. We prove that the score vectors of the belief and the winner density are collinear, linked by a position-dependent tempering field. We give analytical formulas for this field and propose an estimator for it under the Bradley-Terry model. Using a diffusion model trained on tempered samples generated via score-scaled annealed Langevin dynamics, we can learn complex multivariate belief densities of simulated experts, from only hundreds to thousands of pairwise comparisons.

Score-Based Density Estimation from Pairwise Comparisons

TL;DR

The paper tackles learning a target density from pairwise comparisons by linking it to a tempered marginal winner density via a position-dependent tempering field, enabling score-based estimation. It proves that under Bradley–Terry and exponential RUMs the scores satisfy , and develops a practical diffusion-based method to recover by estimating and the tempering field, then sampling from with score-scaled ALD. The approach jointly models the joint and marginal densities to obtain the MWD score, estimates the tempering field through a learned density ratio under BT, and demonstrates improved accuracy over prior flow-based methods on synthetic targets and a real-data proxy (LLM), using hundreds to thousands of pairwise queries. The work advances expert-knowledge elicitation and potential fine-tuning of generative systems by providing a principled, score-based mechanism to represent and sample from human beliefs. Overall, it delivers a theoretically grounded, practically effective framework for density estimation from limited preference data with broad applicability.

Abstract

We study density estimation from pairwise comparisons, motivated by expert knowledge elicitation and learning from human feedback. We relate the unobserved target density to a tempered winner density (marginal density of preferred choices), learning the winner's score via score-matching. This allows estimating the target by `de-tempering' the estimated winner density's score. We prove that the score vectors of the belief and the winner density are collinear, linked by a position-dependent tempering field. We give analytical formulas for this field and propose an estimator for it under the Bradley-Terry model. Using a diffusion model trained on tempered samples generated via score-scaled annealed Langevin dynamics, we can learn complex multivariate belief densities of simulated experts, from only hundreds to thousands of pairwise comparisons.

Paper Structure

This paper contains 38 sections, 7 theorems, 58 equations, 16 figures, 2 tables, 2 algorithms.

Key Result

Theorem 3.1

Assume $W \sim \text{Gumbel}(0,s)$. A tempering field $\tau(\mathbf{x})$ exists between the belief density $p$ and the MWD $p_w$, and it is given by the formula, where $r_s(\mathbf{x},\mathbf{x}') := p^{\frac{1}{s}}(\mathbf{x}')p^{-\frac{1}{s}}(\mathbf{x})$ is the $1/s$-tempered density ratio.

Figures (16)

  • Figure 1: (a) Problem setup. An expert holds a subjective belief over a parameter space, such as the likely hyperparameters of a learning algorithm (e.g. learning rate and weight decay), and can answer questions like "Do you expect configuration A or B to work better?". We learn their belief as a density, to be used e.g. as a prior distribution for finding optimal hyperparameters. (b)-(d) Density estimation from $200$ uniformly sampled pairwise comparisons, with the target density shown as a heatmap. (b) Samples and the score field at an intermediate noise level $\sigma$, for a diffusion model trained on the (winner, loser) pairs to model the marginal winner density (MWD). (c) Estimated tempering field. (d) Samples from the score-scaled annealed Langevin dynamics with the MWD score and a tempering field estimate. Samples align well with the target density, demonstrating the fundamental relationship between the scores of the estimable MWD and the latent target (belief density).
  • Figure 2: Illustration of the relationship $\nabla \log p(\mathbf{x}) = \tau(\mathbf{x}) \nabla \log p_w(\mathbf{x})$ when $p$ is Twomoons2D stimper2022resampling and $\lambda$ is uniform. (a) The score of $p$ (red arrows) and the score of $p_w$ (orange arrows) under the Bradley-Terry model, scaled for better visualization. (b) The estimated tempering field $\tau(\mathbf{x})$ from 200 pairwise comparisons (left, Section \ref{['sec_temp_field_est']}) and the ground-truth (right, Theorem \ref{['theorem_gumbelRUM']}). Due to the colinearity of the scores, the red arrows equal the pointwise product of the orange arrows and the tempering field, which can be estimated (with an underestimation in this example).
  • Figure 3: (a-b) Samples from score-based and flow estimates of Ring2D, with contours indicating the true density. Ring2D illustrates an extreme case where the score-based method clearly outperforms the flow method: the flow model oversamples the center of the ring, where the MWD also has moderate density, whereas the score-based method can downweight it using the tempering field. (c) Cross-plot of the first three variables in the LLM expert elicitation experiment. Full cross-plot and comparison to the flow method are shown in Figs. \ref{['fig-llmexp-full']} and \ref{['fig_llm_baseline']}. The score-based method tends to generate Gaussian-like marginals in this extremely limited data setting.
  • Figure A.1: Illustration of the tempering fields under two different RUMs when $p$ is Twomoons2D stimper2022resampling. The tempering field $\tau(\mathbf{x})$ of the exponential RUM (left, Theorem \ref{['theorem_expRUM']}) and the Bradley-Terry model (right, Theorem \ref{['theorem_gumbelRUM']}).
  • Figure C.1: Replication of Fig. \ref{['fig1']} using (a) only winner samples, with the score model trained only for the MWD $p_w(\mathbf{x})$. The quality of the final density estimate, i.e., the 'tempered' MWD, is clearly inferior compared to (b) training on the full joint $p_{\mathbf{x} \succ \mathbf{x}'}(\mathbf{x}, \mathbf{x}')$ using both winners and losers.
  • ...and 11 more figures

Theorems & Definitions (18)

  • Theorem 3.1
  • proof
  • Proposition 3.2
  • proof
  • Proposition 3.3
  • proof
  • Theorem A.1
  • proof : Proof Sketch
  • proof
  • proof
  • ...and 8 more