Table of Contents
Fetching ...

Preference Learning Algorithms Do Not Learn Preference Rankings

Angelica Chen, Sadhika Malladi, Lily H. Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, Kyunghyun Cho

TL;DR

This work studies the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via ranking accuracy, and finds that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.

Abstract

Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via ranking accuracy. Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the idealized ranking accuracy that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant alignment gap -- i.e., a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.

Preference Learning Algorithms Do Not Learn Preference Rankings

TL;DR

This work studies the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via ranking accuracy, and finds that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.

Abstract

Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via ranking accuracy. Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the idealized ranking accuracy that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant alignment gap -- i.e., a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.
Paper Structure (47 sections, 9 theorems, 36 equations, 12 figures, 6 tables)

This paper contains 47 sections, 9 theorems, 36 equations, 12 figures, 6 tables.

Key Result

Proposition 2.6

Recall the definition of $y_w$, $y_l$ in Definition def:datapoint. If ${\pi_{\text{Ref}}}(y_w\mid x) \geq {\pi_{\text{Ref}}}(y_l\mid x)$ and ${\mathcal{L}}_{\text{DPO}}(x, y_w, y_l; {\pi_\theta}, {\pi_{\text{Ref}}}) \leq 0.6$, then ${{\mathcal{R}}}(x, y_w, y_l) =1$.

Figures (12)

  • Figure 1: Both reference and preference-tuned models exhibit low ranking accuracy on most preference datasets. Each point represents the length-normalized or non-length-normalized ranking accuracy of individual (\ref{['fig:ranking-acc-ref-models']}) reference models (pre-trained or fine-tuned), or (\ref{['fig:ranking-acc-pref-models']}) preference-tuned models (trained with DPO or RLHF). The random chance accuracy for each dataset is indicated with a black 'X'. We sub-sample 1K examples from each dataset and use the test split when available. We describe datasets in \ref{['sec:datasets']} and list all numbers in Tables \ref{['tab:ra-full-pt1']}, \ref{['tab:ra-full-pt2']}, and \ref{['tab:ra-full-pt3']}. For UltraFeedback, ranking accuracy is measured with exact match across all 4 outputs (see App. \ref{['app:ra-def-n-greater-than-2']}).
  • Figure 2: Despite continuously decreasing the loss, DPO rarely flips the rankings of pairs before the point of overfitting (marked by the vertical dashed line) and instead mostly increases the reward margin of already correctly ranked pairs. We train a Pythia-2.8B model for 5 epochs using the DPO objective and categorize the training dataset into four subsets -- examples that initially have the correct ranking and are flipped to (1) correct or (2) incorrect, and examples that initially have the incorrect ranking and are flipped to (3) correct or (4) incorrect. In all three figures, the hue of the point indicates the category. The dashed vertical line indicates the training step at which the lowest eval. loss occurs. Past this point, the model begins to overfit (i.e., the eval. loss starts to increase). We also present results for two other models with three seeds each in Appendix \ref{['app:dpo-training-dynamics']}.
  • Figure 3: DPO loss alone does not predict ranking accuracy, due to the influence of the reference model log-ratio in the loss. Each point represents the DPO loss on a separate training example $(x,y_w,y_l)$ from a subsample of 1K examples from the training dataset, using the model $\pi_{\theta^*}$ that corresponds to the checkpoint with the lowest validation loss. The color of each point indicates whether $\pi_{\theta^*}$ achieves the correct ranking on that example, i.e., whether $\pi_{\theta^*}(y_w|x)>\pi_{\theta^*}(y_l|x)$. The dashed line is the function $f(c)=-\log\sigma(\beta c)$, from \ref{['prop:dpo_loss']}. In summary, the examples that $\pi_{\theta^*}$ classifies correctly tend to be those that were already classified correctly by the reference model. Results for the other two seeds of each model are given in Fig. \ref{['fig:dpo-vs-lsr-vs-flipped-other-seeds']}.
  • Figure 4: When the model weights have not travelled far from $\theta_\text{Ref}$, ranking accuracy and win rate increase together.$\theta_t$ represents the model weights at checkpoint $t$ during DPO training, and $\theta_\gamma$ represents the weights for a model trained to convergence with ${\mathcal{L}}_{\text{DPO}}^\gamma$.
  • Figure 5: Average DPO loss over the course of training, for four categories of the training data (Anthropic HH-RLHF; bai2022training). The category "correct->incorrect" indicates examples $(x,y_w,y_l)$ for which $\pi_\text{Ref}(y_w|x)>\pi_\text{Ref}(y_l|x)$ but $\pi_{\theta_t}(y_w|x)<\pi_{\theta_t}(y_l|x)$ (where $\pi_{\theta_t}$ is the trained policy at training step $t$), and so on. Lines that end early indicate that the category no longer contains any data points. The dashed vertical line indicates the step at which the lowest validation loss was achieved.
  • ...and 7 more figures

Theorems & Definitions (23)

  • Definition 2.1: Aggregated Preference Datapoint
  • Definition 2.2: DPO Objective rafailov2023direct
  • Definition 2.3: Ranking Accuracy
  • Remark 2.4: Lengths of Completions
  • Remark 2.5: Difference between Ranking Accuracy and Reward Accuracy
  • Proposition 2.6: Sanity Check
  • Theorem 3.1: Simulating Perfect RLHF
  • Remark 3.2
  • Corollary 3.3: Idealized Ranking Accuracy
  • Theorem 4.1
  • ...and 13 more