Table of Contents
Fetching ...

Mitigating Preference Hacking in Policy Optimization with Pessimism

Dhawal Gupta, Adam Fisch, Christoph Dann, Alekh Agarwal

TL;DR

The paper tackles reward/preference hacking in RLHF arising from limited preference data by introducing pessimistic, uncertainty-aware objectives for policy optimization. It develops a restricted pessimistic Nash framework and presents practical algorithms P3O (general preferences) and PRPO (reward-based) to optimize these objectives. The approach offers theoretical guarantees under data-coverage assumptions and employs tractable approximations for the log-partition function and uncertainty sets. Empirically, P3O and PRPO outperform standard RLHF methods on language model summarization and helpfulness tasks, demonstrating reduced overoptimization and more robust, concise outputs. This work advances robust alignment by integrating pessimism with data-aware constraints, enabling safer, more reliable policy refinement from imperfect human feedback.

Abstract

This work tackles the problem of overoptimization in reinforcement learning from human feedback (RLHF), a prevalent technique for aligning models with human preferences. RLHF relies on reward or preference models trained on \emph{fixed preference datasets}, and these models are unreliable when evaluated outside the support of this preference data, leading to the common reward or preference hacking phenomenon. We propose novel, pessimistic objectives for RLHF which are provably robust to overoptimization through the use of pessimism in the face of uncertainty, and design practical algorithms, P3O and PRPO, to optimize these objectives. Our approach is derived for the general preference optimization setting, but can be used with reward models as well. We evaluate P3O and PRPO on the tasks of fine-tuning language models for document summarization and creating helpful assistants, demonstrating remarkable resilience to overoptimization.

Mitigating Preference Hacking in Policy Optimization with Pessimism

TL;DR

The paper tackles reward/preference hacking in RLHF arising from limited preference data by introducing pessimistic, uncertainty-aware objectives for policy optimization. It develops a restricted pessimistic Nash framework and presents practical algorithms P3O (general preferences) and PRPO (reward-based) to optimize these objectives. The approach offers theoretical guarantees under data-coverage assumptions and employs tractable approximations for the log-partition function and uncertainty sets. Empirically, P3O and PRPO outperform standard RLHF methods on language model summarization and helpfulness tasks, demonstrating reduced overoptimization and more robust, concise outputs. This work advances robust alignment by integrating pessimism with data-aware constraints, enabling safer, more reliable policy refinement from imperfect human feedback.

Abstract

This work tackles the problem of overoptimization in reinforcement learning from human feedback (RLHF), a prevalent technique for aligning models with human preferences. RLHF relies on reward or preference models trained on \emph{fixed preference datasets}, and these models are unreliable when evaluated outside the support of this preference data, leading to the common reward or preference hacking phenomenon. We propose novel, pessimistic objectives for RLHF which are provably robust to overoptimization through the use of pessimism in the face of uncertainty, and design practical algorithms, P3O and PRPO, to optimize these objectives. Our approach is derived for the general preference optimization setting, but can be used with reward models as well. We evaluate P3O and PRPO on the tasks of fine-tuning language models for document summarization and creating helpful assistants, demonstrating remarkable resilience to overoptimization.

Paper Structure

This paper contains 38 sections, 4 theorems, 46 equations, 14 figures, 4 tables, 2 algorithms.

Key Result

Lemma 3.2

We denote the restricted pessimistic Nash policy by $\pi_{\mathrm{rp}\textrm{-}\mathrm{nash}}$ from eq:restricted-pess-nash, and let $p^\star$ be the ground-truth preference function underlying ${\mathcal{D}}$. Then we have that for any $\pi\in\Pi(\pi_{\mathrm{data}}, C)$ with $C \geq 1$: where $\varepsilon$ is a bound on how much preference functions in ${\mathcal{P}}$ can disagree in total vari

Figures (14)

  • Figure 1: Comparison of our methods (P3O and PRPO) against standard approaches (REINFORCE and DPO) on summarization and "helpful assistant" tasks, showing the evaluation preference of a prompted Gemini evaluator, for generations of the policy over those of the reference policy. The hyperparameters of each method have been tuned to prevent reward hacking (best eval performance), which necessitates strong KL-regularization for DPO and REINFORCE. Our approaches, however, can avoid reward hacking by relying on pessimism instead of KL regularization (a blunt tool), and achieve consistently better performance as a result. We also compare to Nash-EMA munos2023nash, a natural baseline that also employs a preference model instead of a reward model, but without any pessimism. Like REINFORCE, the best Nash-EMA values still plateau to a lower performance on summarization. While it achieves similar win-rates over $\pi_{\mathrm{ref}}$ on helpfulness, it does so at the cost of much longer and idiosyncratic generations on average (see Figure \ref{['fig:qualitative-results']}). Shaded areas show $95\%$ CIs of the evaluation.
  • Figure 2: An illustration of the problematic example for pessimistic preference optimization with unrestricted opponents. We assume that $\{y_1, y_2\}$ are well-sampled in the preference data, whereas $y_3$ is not---resulting in certain preferences for $y_1$ vs. $y_2$, but completely uncertain preferences for $y_3$ vs. others (Left). The $3\times3$ matrices above are then the optimized pessimistic preference matrices for $\{y_1, y_2, y_3\}$, with the $3d$ vectors the optimized competing policies. Specifically, shaded entries represent optimizable variables, and the values in each of the blue and red shaded entries are the solutions for the max ($\mathop{\rm arg\,max}_\pi$) and min player ($\mathop{\rm arg\,min}_{\pi'}$ and $\mathop{\rm arg\,min}_{p \in {\mathcal{P}}}$), resp., see \ref{['eq:max-min-min-objective']}. Middle: when the opponent $\pi'$ is unrestricted, the optimal policy $\pi_{\mathrm{p}\textrm{-}\mathrm{nash}}$ must hedge and put significant support on $y_3$. Right: restricting the support of $\pi'$ to the support of the preference-data (i.e., $\{y_1, y_2\}$), avoids this issue, and yields a more reasonable optimal policy $\pi_{\mathrm{rp}\textrm{-}\mathrm{nash}}$.
  • Figure 3: Tabular experiments: Comparison of the different objectives with an explicit search over the main policy, opponent policy, and version space. The $X$-axis shows the probability assigned to the under-sampled output ($y_3$), from $0.0 \rightarrow 0.2$. The $Y$-axis indicates the minimum preference of the policy found over all covered actions (higher is better). We also show the ground-truth preference matrix $p^\star$ (top-right) and the corresponding $\pi_{\mathrm{nash}}$ (bottom-right). Results are averaged over 10 random seeds, shaded areas represent $\pm 2 \times$ std error. EP3O(0.1) corresponding to the restricted Nash formulation consistently does well, particularly when the sampling rate of $y_3$ is very low (left part of plot).
  • Figure 4: Qualitative results for the helpfulness and summarization tasks: On helpfulness, both response length and list formats are common reward hacks eisenstein2024helping. While policies do generate longer responses than the responses in $\pi_{\mathrm{data}}$, both REINFORCE and Nash-EMA converge on generations that are $\approx 40-50\%$ longer than those of P3O and PRPO (left). Similarly, REINFORCE degenerates into producing responses that are nearly all formatted as lists (with Nash-EMA at over $50\%$), while P3O and PRPO stay closer to $\pi_{\mathrm{data}}$ (middle). On summarization, DPO, REINFORCE, and Nash-EMA all show clear signs of length hacking---also a pervasive issue on this task eisenstein2024helpingsinghal2023longpark2024disentangling. In contrast, both PRPO and P3O converge to the average length of preferred responses in $\pi_{\mathrm{data}}$, all while also achieving the highest win-rates (right).
  • Figure 5: Confusion matrix showing the pairwise evaluation across the different methods, for each method's best checkpoint selected in terms of preferences over the reference policy. In head-to-head comparisons, P3O and PRPO outperform baselines, except Nash-EMA on helpfulness, where the Gemini 1.5 Flash evaluation preference aligns with the preference model used for training---favoring the verbose, list-heavy outputs of Nash-EMA (and REINFORCE, to a lesser degree). However, when looking for both helpful and concise responses, P3O and PRPO win rates increase dramatically against Nash-EMA (see Figure \ref{['fig:confusion-concise']} in the appendix), demonstrating their robustness.
  • ...and 9 more figures

Theorems & Definitions (9)

  • Definition 3.1: Covered policy set
  • Lemma 3.2: Preference guarantee for the restricted pessimistic Nash policy
  • Lemma 4.1
  • Lemma 4.2
  • Definition 3.1: Covered policy set
  • Lemma 3.2: Preference guarantee for the restricted pessimistic Nash policy
  • proof : Proof of Lemma \ref{['lem:restricted-nash']}
  • proof
  • proof