Table of Contents
Fetching ...

Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, Quanquan Gu

TL;DR

The paper addresses the inadequacy of Bradley-Terry reward models for capturing complex human preferences in language model alignment, particularly intransitive ones. It introduces General Preference Embedding Model (GPM), which embeds responses in a latent space with a skew-symmetric operator to capture nuanced preferences while maintaining linear query complexity. It then proposes General Preference Optimization (GPO) that leverages the derived preference scores for policy optimization, with convergence guarantees and compatibility with existing RLHF methods. Empirically, GPM outperforms BT on RewardBench, handles cyclic preferences effectively, and improves downstream alignment benchmarks such as AlpacaEval 2.0, MT-Bench, GSM8K, and MMLU, suggesting more reliable alignment to nuanced human values.

Abstract

Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. In this paper, we introduce preference embedding, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback (RLHF). Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0, following the language model post-training with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model.

Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

TL;DR

The paper addresses the inadequacy of Bradley-Terry reward models for capturing complex human preferences in language model alignment, particularly intransitive ones. It introduces General Preference Embedding Model (GPM), which embeds responses in a latent space with a skew-symmetric operator to capture nuanced preferences while maintaining linear query complexity. It then proposes General Preference Optimization (GPO) that leverages the derived preference scores for policy optimization, with convergence guarantees and compatibility with existing RLHF methods. Empirically, GPM outperforms BT on RewardBench, handles cyclic preferences effectively, and improves downstream alignment benchmarks such as AlpacaEval 2.0, MT-Bench, GSM8K, and MMLU, suggesting more reliable alignment to nuanced human values.

Abstract

Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. In this paper, we introduce preference embedding, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback (RLHF). Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0, following the language model post-training with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model.
Paper Structure (33 sections, 5 theorems, 76 equations, 3 figures, 9 tables)

This paper contains 33 sections, 5 theorems, 76 equations, 3 figures, 9 tables.

Key Result

Theorem 4.4

Let $\mathbf{P} \in \mathbb{R}^{k \times k}$ be a real skew-symmetric matrix (i.e., $\mathbf{P} = -\mathbf{P}^\top$). Then there exist vectors $\{\mathbf{v}_i\}_{i=1}^{k} \subset \mathbb{R}^{2k}$ and a block-diagonal skew-symmetric matrix $\mathbf{R}^\succ \in \mathbb{R}^{2k \times 2k}$, with $\math such that:

Figures (3)

  • Figure 1: Illustration of (a) Bradley Terry (BT) reward model, (b) supervised pair preference model (PairRM, PairPM) jiang2023llmdong2024rlhf, and (c) our General Preference embedding Model (GPM).
  • Figure 2: Intransitiveness in real-world preferences. Left: Food preferences might cycle (Apple $\succ$ Banana $\succ$ Cherry $\succ$ Apple). Right: Rock-Paper-Scissors is a classic intransitive game.
  • Figure 3: Visualization of learned preference embedding vectors for cyclic preferences with sizes 3, 4, and 5, e.g., $A \succ B \succ C \succ A$.

Theorems & Definitions (9)

  • Definition 4.1: Preference Embeddings
  • Definition 4.2: Skew-symmetric Preference Operator
  • Definition 4.3: Preference Score
  • Theorem 4.4: Expressiveness of Preference Embedding Model
  • Theorem 4.5
  • Theorem 5.1
  • Remark 5.2
  • Proposition 1.1
  • Theorem 1.2: Expressiveness of Complex Preference Embeddings