Preference Learning Algorithms Do Not Learn Preference Rankings

Angelica Chen; Sadhika Malladi; Lily H. Zhang; Xinyi Chen; Qiuyi Zhang; Rajesh Ranganath; Kyunghyun Cho

Preference Learning Algorithms Do Not Learn Preference Rankings

Angelica Chen, Sadhika Malladi, Lily H. Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, Kyunghyun Cho

TL;DR

This work studies the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via ranking accuracy, and finds that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.

Abstract

Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via ranking accuracy. Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the idealized ranking accuracy that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant alignment gap -- i.e., a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.

Preference Learning Algorithms Do Not Learn Preference Rankings

TL;DR

Abstract

Paper Structure (47 sections, 9 theorems, 36 equations, 12 figures, 6 tables)

This paper contains 47 sections, 9 theorems, 36 equations, 12 figures, 6 tables.

Introduction
Preliminaries
Learning from Human Preferences
Preference Data
Supervised Fine-Tuning (SFT)
Reinforcement Learning from Human Feedback (RLHF)
Preference Learning with DPO
Evaluation Metrics
The Alignment Gap
Existing Reference Models Rarely Have Correct Rankings
Idealized Ranking Accuracy
Measuring the Alignment Gap
Understanding Ranking Accuracy with DPO
DPO Rarely Flips Preference Rankings
Analysis: How Easy Is It To Flip A Ranking?
...and 32 more sections

Key Result

Proposition 2.6

Recall the definition of $y_w$, $y_l$ in Definition def:datapoint. If ${\pi_{\text{Ref}}}(y_w\mid x) \geq {\pi_{\text{Ref}}}(y_l\mid x)$ and ${\mathcal{L}}_{\text{DPO}}(x, y_w, y_l; {\pi_\theta}, {\pi_{\text{Ref}}}) \leq 0.6$, then ${{\mathcal{R}}}(x, y_w, y_l) =1$.

Figures (12)

Figure 1: Both reference and preference-tuned models exhibit low ranking accuracy on most preference datasets. Each point represents the length-normalized or non-length-normalized ranking accuracy of individual (\ref{['fig:ranking-acc-ref-models']}) reference models (pre-trained or fine-tuned), or (\ref{['fig:ranking-acc-pref-models']}) preference-tuned models (trained with DPO or RLHF). The random chance accuracy for each dataset is indicated with a black 'X'. We sub-sample 1K examples from each dataset and use the test split when available. We describe datasets in \ref{['sec:datasets']} and list all numbers in Tables \ref{['tab:ra-full-pt1']}, \ref{['tab:ra-full-pt2']}, and \ref{['tab:ra-full-pt3']}. For UltraFeedback, ranking accuracy is measured with exact match across all 4 outputs (see App. \ref{['app:ra-def-n-greater-than-2']}).
Figure 2: Despite continuously decreasing the loss, DPO rarely flips the rankings of pairs before the point of overfitting (marked by the vertical dashed line) and instead mostly increases the reward margin of already correctly ranked pairs. We train a Pythia-2.8B model for 5 epochs using the DPO objective and categorize the training dataset into four subsets -- examples that initially have the correct ranking and are flipped to (1) correct or (2) incorrect, and examples that initially have the incorrect ranking and are flipped to (3) correct or (4) incorrect. In all three figures, the hue of the point indicates the category. The dashed vertical line indicates the training step at which the lowest eval. loss occurs. Past this point, the model begins to overfit (i.e., the eval. loss starts to increase). We also present results for two other models with three seeds each in Appendix \ref{['app:dpo-training-dynamics']}.
Figure 3: DPO loss alone does not predict ranking accuracy, due to the influence of the reference model log-ratio in the loss. Each point represents the DPO loss on a separate training example $(x,y_w,y_l)$ from a subsample of 1K examples from the training dataset, using the model $\pi_{\theta^*}$ that corresponds to the checkpoint with the lowest validation loss. The color of each point indicates whether $\pi_{\theta^*}$ achieves the correct ranking on that example, i.e., whether $\pi_{\theta^*}(y_w|x)>\pi_{\theta^*}(y_l|x)$. The dashed line is the function $f(c)=-\log\sigma(\beta c)$, from \ref{['prop:dpo_loss']}. In summary, the examples that $\pi_{\theta^*}$ classifies correctly tend to be those that were already classified correctly by the reference model. Results for the other two seeds of each model are given in Fig. \ref{['fig:dpo-vs-lsr-vs-flipped-other-seeds']}.
Figure 4: When the model weights have not travelled far from $\theta_\text{Ref}$, ranking accuracy and win rate increase together.$\theta_t$ represents the model weights at checkpoint $t$ during DPO training, and $\theta_\gamma$ represents the weights for a model trained to convergence with ${\mathcal{L}}_{\text{DPO}}^\gamma$.
Figure 5: Average DPO loss over the course of training, for four categories of the training data (Anthropic HH-RLHF; bai2022training). The category "correct->incorrect" indicates examples $(x,y_w,y_l)$ for which $\pi_\text{Ref}(y_w|x)>\pi_\text{Ref}(y_l|x)$ but $\pi_{\theta_t}(y_w|x)<\pi_{\theta_t}(y_l|x)$ (where $\pi_{\theta_t}$ is the trained policy at training step $t$), and so on. Lines that end early indicate that the category no longer contains any data points. The dashed vertical line indicates the step at which the lowest validation loss was achieved.
...and 7 more figures

Theorems & Definitions (23)

Definition 2.1: Aggregated Preference Datapoint
Definition 2.2: DPO Objective rafailov2023direct
Definition 2.3: Ranking Accuracy
Remark 2.4: Lengths of Completions
Remark 2.5: Difference between Ranking Accuracy and Reward Accuracy
Proposition 2.6: Sanity Check
Theorem 3.1: Simulating Perfect RLHF
Remark 3.2
Corollary 3.3: Idealized Ranking Accuracy
Theorem 4.1
...and 13 more

Preference Learning Algorithms Do Not Learn Preference Rankings

TL;DR

Abstract

Preference Learning Algorithms Do Not Learn Preference Rankings

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (23)