Table of Contents
Fetching ...

Self-Improving Robust Preference Optimization

Eugene Choi, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin, Mohammad Gheshlaghi Azar

TL;DR

SRPO introduces a self-improving robust preference optimization framework that addresses inference-time self-correction and task-robustness in offline RLHF. It formulates a KL-regularized min-max objective between a self-improvement policy and a robust generative policy, which can be reduced to a non-adversarial supervised loss $L_{\alpha}$ and estimated offline without reward models. Theoretical results show how the optimal self-improvement policy and robust policy relate to human preference probabilities, enabling joint optimization through a single objective. Empirically, SRPO demonstrates strong win-rate gains over DPO and IPO, including substantial improvements in out-of-distribution settings and effective recursive refinements through self-improvement, highlighting robustness to the behavior policy and scalability across tasks.

Abstract

Online and offline RLHF methods, such as PPO and DPO, have been highly successful in aligning AI with human preferences. Despite their success, however, these methods suffer from fundamental limitations: (a) Models trained with RLHF can learn from mistakes or negative examples through RL mechanism or contrastive loss during training. However, at inference time, they lack an innate self-improvement mechanism for error corrections. (b) The optimal solution of existing methods is highly task-dependent, making it difficult for them to generalize to new tasks. To address these challenges, we propose Self-Improving Robust Preference Optimization (SRPO), a practical and mathematically principled offline RLHF framework. The key idea behind SRPO is to cast the problem of learning from human preferences as a self-improvement process, mathematically formulated as a min-max objective that jointly optimizes a self-improvement policy and a generative policy in an adversarial fashion. Crucially, the solution for this optimization problem is independent of the training task, which makes it robust to its changes. We then show that this objective can be reformulated as a non-adversarial offline loss, which can be efficiently optimized using standard supervised learning techniques at scale. To demonstrate SRPO's effectiveness, we evaluate it using AI Win-Rate (WR) against human (GOLD) completions. When tested on the XSum dataset, SRPO outperforms DPO by a margin of 15% after 5 self revisions, achieving an impressive 90% WR. Moreover, on the challenging Arena-Hard prompts, SRPO outperforms both DPO and IPO (by 4% without revision and 6% after a single revision), reaching a 56% WR against against Llama-3.1-8B-Instruct.

Self-Improving Robust Preference Optimization

TL;DR

SRPO introduces a self-improving robust preference optimization framework that addresses inference-time self-correction and task-robustness in offline RLHF. It formulates a KL-regularized min-max objective between a self-improvement policy and a robust generative policy, which can be reduced to a non-adversarial supervised loss and estimated offline without reward models. Theoretical results show how the optimal self-improvement policy and robust policy relate to human preference probabilities, enabling joint optimization through a single objective. Empirically, SRPO demonstrates strong win-rate gains over DPO and IPO, including substantial improvements in out-of-distribution settings and effective recursive refinements through self-improvement, highlighting robustness to the behavior policy and scalability across tasks.

Abstract

Online and offline RLHF methods, such as PPO and DPO, have been highly successful in aligning AI with human preferences. Despite their success, however, these methods suffer from fundamental limitations: (a) Models trained with RLHF can learn from mistakes or negative examples through RL mechanism or contrastive loss during training. However, at inference time, they lack an innate self-improvement mechanism for error corrections. (b) The optimal solution of existing methods is highly task-dependent, making it difficult for them to generalize to new tasks. To address these challenges, we propose Self-Improving Robust Preference Optimization (SRPO), a practical and mathematically principled offline RLHF framework. The key idea behind SRPO is to cast the problem of learning from human preferences as a self-improvement process, mathematically formulated as a min-max objective that jointly optimizes a self-improvement policy and a generative policy in an adversarial fashion. Crucially, the solution for this optimization problem is independent of the training task, which makes it robust to its changes. We then show that this objective can be reformulated as a non-adversarial offline loss, which can be efficiently optimized using standard supervised learning techniques at scale. To demonstrate SRPO's effectiveness, we evaluate it using AI Win-Rate (WR) against human (GOLD) completions. When tested on the XSum dataset, SRPO outperforms DPO by a margin of 15% after 5 self revisions, achieving an impressive 90% WR. Moreover, on the challenging Arena-Hard prompts, SRPO outperforms both DPO and IPO (by 4% without revision and 6% after a single revision), reaching a 56% WR against against Llama-3.1-8B-Instruct.
Paper Structure (24 sections, 1 theorem, 21 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 24 sections, 1 theorem, 21 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Given a context $x$ and the behavior policy $\mu(\cdot|x)$ let $\mu(y|x)$ and scalars $\alpha\in[0,1]$, $\beta>0$ we have that the solution of the min-max objective of eq:main-obj is obtained by minimizing the following loss where $L(\pi, \pi_{\dagger})$ and $L_{\dagger}(\pi_{\dagger})$ are defined respectively as follows and

Figures (5)

  • Figure 1: Self-improvement by SRPO in terms of win rates against human ( WR). We demonstrate robustness by training on TL;DR and evaluating on XSum. Gains on Direct Preference Optimization (DPO) are reported in text captions.
  • Figure 2: Learned action probabilities for the synthetic example. SRPO always chooses the correct arm regardless of skew in $\mu$, while both IPO and DPO are effected by the skew (Fig. (\ref{['fig:mu1']})).
  • Figure 3: We present the win rates of SRPO, IPO, and DPO against human-written summaries (GOLD) as a function of $N$-revisions for both in-distribution (TL;DR) and out-of-distribution (XSum) settings. The curves represent the mean win rates, with shaded areas indicating the st.dev. across 20 bootstrap evaluations. Notably, DPO and IPO show no improvements in their generations, whereas SRPO shows significant improvements with each iteration.
  • Figure 4: We present the win rates of SRPO against human-written summaries (GOLD) as a function of $N$-revision iterations at different $\alpha$ values. We report their mean (curve) $\pm$ st.dev. (shaded area), across 20 bootstrap evaluations, as described in the Evaluation section. We observe that SRPO achieves meaningful iterative improvements capability as the value of $\alpha$ increases.
  • Figure 5: We present the win rates of SRPO, IPO, DPO, SIMIPO and RPO against LLaMA-3.1-8B-Instruct as a function of $N$-revisions for Arena-hard prompts setting. The curves represent the mean win rates, with shaded areas indicating the st.dev. across 20 bootstrap evaluations. Notably SRPO dominates the win-rates of all other methods across all revision steps.

Theorems & Definitions (2)

  • Theorem 1
  • Remark 2