What Does Preference Learning Recover from Pairwise Comparison Data?

Rattana Pukdee; Maria-Florina Balcan; Pradeep Ravikumar

What Does Preference Learning Recover from Pairwise Comparison Data?

Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar

TL;DR

This work reframes pairwise preference learning by starting from the observed triplet distribution and introducing the Conditional Preference Distribution (CPRD), which encodes the context-specific probability of one outcome beating another. It derives precise conditions under which CPRD is representable by a Bradley--Terry (BT) model, notably via a positive--negative conditional independence structure, and interprets BT learning as KL projections onto the BT family. The authors prove that both generative and discriminative BT objectives recover the CPRD when BT representability or CI holds, and otherwise converge to the closest BT-approximating CPRD, revealing what is actually learned. They further quantify sample efficiency through pairwise margin and comparison-graph connectivity, providing finite-sample bounds that link learning performance to these data properties. Empirical results on synthetic data validate the theory, showing margin amplification and connectivity enhancement improve recovery, and that optimizing data collection can mitigate bottlenecks in learning.

Abstract

Pairwise preference learning is central to machine learning, with recent applications in aligning language models with human preferences. A typical dataset consists of triplets $(x, y^+, y^-)$, where response $y^+$ is preferred over response $y^-$ for context $x$. The Bradley--Terry (BT) model is the predominant approach, modeling preference probabilities as a function of latent score differences. Standard practice assumes data follows this model and learns the latent scores accordingly. However, real data may violate this assumption, and it remains unclear what BT learning recovers in such cases. Starting from triplet comparison data, we formalize the preference information it encodes through the conditional preference distribution (CPRD). We give precise conditions for when BT is appropriate for modeling the CPRD, and identify factors governing sample efficiency -- namely, margin and connectivity. Together, these results offer a data-centric foundation for understanding what preference learning actually recovers.

What Does Preference Learning Recover from Pairwise Comparison Data?

TL;DR

Abstract

Pairwise preference learning is central to machine learning, with recent applications in aligning language models with human preferences. A typical dataset consists of triplets

, where response

is preferred over response

for context

. The Bradley--Terry (BT) model is the predominant approach, modeling preference probabilities as a function of latent score differences. Standard practice assumes data follows this model and learns the latent scores accordingly. However, real data may violate this assumption, and it remains unclear what BT learning recovers in such cases. Starting from triplet comparison data, we formalize the preference information it encodes through the conditional preference distribution (CPRD). We give precise conditions for when BT is appropriate for modeling the CPRD, and identify factors governing sample efficiency -- namely, margin and connectivity. Together, these results offer a data-centric foundation for understanding what preference learning actually recovers.

Paper Structure (29 sections, 19 theorems, 121 equations, 4 figures)

This paper contains 29 sections, 19 theorems, 121 equations, 4 figures.

Introduction
Contributions.
Related Work
Bradley--Terry and representability.
Statistical analysis.
Preference learning for large language models.
Setup
When does a CPRD admit a Bradley--Terry model?
Learning CPRD
Learning Score Functions from Pairwise Comparisons
Summary.
Experiments
Setup.
Larger Margin Leads to Higher Accuracy
Rank normalization.
...and 14 more sections

Key Result

proposition 4.1

For a distribution $P$ over $\mathcal{X} \times \mathcal{Y} \times \mathcal{Y}$, the CPRD $\omega_P$ is representable by a BT model if and only if there exists a strictly positive function $h:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}$ such that for all $x$ and all $y\neq y'$ with $P(x,y,y')+P(x,y',y

Figures (4)

Figure 1: Effect of margin on learning efficiency. Learning from the rank-normalized score $r^*_{\text{rank}}$ (which has larger minimum margin) achieves higher accuracy than learning from $r^*$, especially when the number of triplets is small. The gap diminishes as sample size increases.
Figure 2: Effect of comparison graph connectivity on learning. High connectivity degress values are associated with high accuracy, though the relationship is not strictly monotonic.
Figure 3: Effect of connectivity optimization on learning. Optimizing $p^-$ to increase the connectivity can help with the accuracy when the task is hard where $\beta$ is large.
Figure 4: Effect of score margin on learning efficiency on pairs with the smallest true margin.

Theorems & Definitions (38)

definition 3.1: Conditional Preference Distribution (CPRD)
definition 3.2: Bradley--Terry Model
definition 4.1
proposition 4.1: CPRD BT factorization
definition 4.2: Positive--negative conditional independence
theorem 4.3: CPRD of conditionally independent distribution and BT model
definition 5.1: Comparison distribution
theorem 5.2: Decomposition of the discriminative BT learning objective
corollary 5.2
corollary 5.2: Consistency under conditional independence
...and 28 more

What Does Preference Learning Recover from Pairwise Comparison Data?

TL;DR

Abstract

What Does Preference Learning Recover from Pairwise Comparison Data?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (38)