On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Jiancong Xiao; Ziniu Li; Xingyu Xie; Emily Getzen; Cong Fang; Qi Long; Weijie J. Su

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Jiancong Xiao, Ziniu Li, Xingyu Xie, Emily Getzen, Cong Fang, Qi Long, Weijie J. Su

TL;DR

This work identifies algorithmic bias in standard RLHF arising from KL-based regularization, which can exaggerate minority preferences and even cause preference collapse. It then derives a Preference Matching RLHF framework, with a PM regularizer that enforces alignment with the reward model's preference distribution under PL/BTL assumptions, effectively balancing reward maximization with response diversity. To tackle practical text-generation issues, the authors introduce a conditional PM RLHF that restricts consideration to meaningful outputs, while preserving PM properties. Empirical results on OPT and Llama-family models show substantial reductions in the PM divergence metric (29%–41% improvements) compared to KL RLHF, indicating improved fidelity to human preferences and fairness in produced content.

Abstract

Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that the predominant approach for aligning LLMs with human preferences through a reward model -- reinforcement learning from human feedback (RLHF) -- suffers from an inherent algorithmic bias due to its Kullback--Leibler-based regularization in optimization. In extreme cases, this bias could lead to a phenomenon we term preference collapse, where minority preferences are virtually disregarded. To mitigate this algorithmic bias, we introduce preference matching (PM) RLHF, a novel approach that provably aligns LLMs with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses, which helps the LLM balance response diversification and reward maximization. Notably, we obtain this regularizer by solving an ordinary differential equation that is necessary for the PM property. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation. Finally, we empirically validate the effectiveness of conditional PM RLHF through experiments on the OPT and Llama-family models, demonstrating a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF.

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

TL;DR

Abstract

Paper Structure (54 sections, 7 theorems, 69 equations, 8 figures, 5 tables)

This paper contains 54 sections, 7 theorems, 69 equations, 8 figures, 5 tables.

Introduction
Related Work
Perspective on Bias in Preference Modeling.
Diversity in Human Preferences.
Preliminaries
Reinforcement Learning from Human Feedback
Step 1: Supervised Fine-Tuning (SFT).
Step 2: Preference and Reward Modeling.
Step 3: Policy Learning and RLHF Fine-tuning.
Preference Matching RLHF
Preference Matching Differential Equation
Preference Matching Regularization
A Dual Perspective.
Extension to Response-Dependent Regularization
Conditional Preference Matching RLHF
...and 39 more sections

Key Result

Proposition 3.3

Let $\phi^\star$ be an optimal solution to the unregularized reward maximization problem $\max_{\phi}\mathbb{E}_{y\sim\pi_\phi(\cdot|x)} r(x,y)$, where the expectation is over the randomness of both $x$ and $y$ following the conditional distribution $\pi_\phi(\cdot|x)$. For a fixed $x$, with probabi

Figures (8)

Figure 1: Output probabilities of the reference model and reward model when fine-tuning the Llama-2-7B model. Panels (a), (b), and (c) demonstrate the aggregate PM divergence for different $\alpha$ and $\beta$ values. Panels (d), (e), and (f) demonstrate the histogram and CDF of the distribution of instance PM divergence in different settings.
Figure 2: Algorithmic bias when $\beta\neq 1$. Smaller $\beta$, more severe algorithmic bias.
Figure 3: Output probabilities for the reference and reward models in the experiment on OPT-1.3B. Panel (a) shows the output probabilities of the reference model, while panel (b) shows the preferences on the same $(y_1,y_2)$ pairs derived from the reward model.
Figure 4: Algorithmic bias of the KL RLHF--aligned model for various $p_\textnormal{ref}(y)$ and $\beta = 1$. Here, "preference on y" refers to the conditional probability $p(y_1|y_1,y_2,x)$. Algorithmic bias occurs when $p_\textnormal{ref}(y)\neq 0.5$. If $p_\textnormal{ref}(y) < 0.5$, the KL RLHF--aligned model underestimates the probability of $y_1$, and vice versa. The red line, where the aligned model and the reward model have the same preference on $y$, represents the desired alignment.
Figure 5: Output probabilities for the reference and reward models in the experiments conducted on Llama-2-7B. Panel (a) shows the output probabilities of the reference model, while panel (b) shows the preferences on the same $(y_1,y_2)$ pairs derived from the reward model.
...and 3 more figures

Theorems & Definitions (13)

Definition 3.1: Preference Matching
Remark 3.2
Proposition 3.3
Theorem 1
Remark 3.4
Definition 3.5
Theorem 2
Proposition 3.6
Theorem 3
Definition 4.1
...and 3 more

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

TL;DR

Abstract

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (13)