Table of Contents
Fetching ...

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

Ruijie Xu, Zhihan Liu, Yongfei Liu, Shipeng Yan, Zhaoran Wang, Zhi Zhang, Xuming He

TL;DR

This work proposes a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities, and employs fine-grained arithmetic control over the optimality gap between positive and negative examples.

Abstract

We address the challenge of online Reinforcement Learning from Human Feedback (RLHF) with a focus on self-rewarding alignment methods. In online RLHF, obtaining feedback requires interaction with the environment, which can be costly when using additional reward models or the GPT-4 API. Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities, which are effective for large-scale models but challenging to transfer to smaller ones. To address these limitations, we propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities. Additionally, we employ fine-grained arithmetic control over the optimality gap between positive and negative examples, generating more hard negatives in the later stages of training to help the model better capture subtle human preferences. Finally, we conduct extensive experiments on two base models, Mistral-7B and Mistral-Instruct-7B, which significantly bootstrap the performance of the reference model, achieving 34.5% in the Length-controlled Win Rates of AlpacaEval 2.0.

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

TL;DR

This work proposes a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities, and employs fine-grained arithmetic control over the optimality gap between positive and negative examples.

Abstract

We address the challenge of online Reinforcement Learning from Human Feedback (RLHF) with a focus on self-rewarding alignment methods. In online RLHF, obtaining feedback requires interaction with the environment, which can be costly when using additional reward models or the GPT-4 API. Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities, which are effective for large-scale models but challenging to transfer to smaller ones. To address these limitations, we propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities. Additionally, we employ fine-grained arithmetic control over the optimality gap between positive and negative examples, generating more hard negatives in the later stages of training to help the model better capture subtle human preferences. Finally, we conduct extensive experiments on two base models, Mistral-7B and Mistral-Instruct-7B, which significantly bootstrap the performance of the reference model, achieving 34.5% in the Length-controlled Win Rates of AlpacaEval 2.0.
Paper Structure (24 sections, 2 theorems, 12 equations, 4 figures, 3 tables)

This paper contains 24 sections, 2 theorems, 12 equations, 4 figures, 3 tables.

Key Result

Proposition 1

Given the instruction $x$ and different reward scores, the policy can generate responses of varying quality. We denote two different policies as $\pi_{\text{good}}$ and $\pi_{\text{bad}}$, which have the following forms

Figures (4)

  • Figure 1: Distribution of scores for responses generated by the self-rewarding algorithm yuan2024self on the Mistral-7B model. The prompts come from Ultralfeedback cui2023ultrafeedback, totaling approximately 60k. About 80% of the responses are rated with a score of 4.
  • Figure 2: Our method consists of two parts: generating the preference dataset and conducting DPO training. When generating the preference dataset, the input is the prompt $x_i$. To generate the chosen response $y^c_i$, we prepend a chosen prefix to $x_i$, and to generate the rejected response $y^r_i$, we prepend a rejected prefix to $x_i$. The final preference dataset is composed of $\{x_i, y^c_i, y^r_i\}$. This dataset will be used for the current round of DPO training. The trained model from this round will serve as the reference model for the next round.
  • Figure 3: Average reward scores of responses generated with different prefix scores for the same model. The model used here is the reference model from the first iteration.
  • Figure 4: Average reward scores of chosen responses and rejected responses across different iterations. For each iteration, rejected responses are shown on the left and chosen responses on the right. The rejected scores for iterations 1, 2, and 3 are 3, 5, and 7, respectively.

Theorems & Definitions (4)

  • Proposition 1: Quality gap between responses
  • proof
  • Lemma 1: Oracle optimal KL-regularized policy
  • proof