Table of Contents
Fetching ...

WPO: Enhancing RLHF with Weighted Preference Optimization

Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, Chenguang Zhu

TL;DR

This paper tackles the distribution gap problem in off-policy RLHF by introducing Weighted Preference Optimization (WPO), which reweights preference pairs by their probability under the current policy to simulate on-policy learning without additional data collection costs. A bootstrapping-inspired data regeneration and a token-level weight alignment mechanism (including greedy and sampled variants, with sampled alignment as the default) enable stable, on-policy-like optimization using off-policy data. Empirical results on Alpaca Eval 2 and MT-bench show WPO consistently beating Direct Preference Optimization (DPO) and achieving strong performance in hybrid RL settings, including a 76.7% length-controlled win rate against GPT-4-turbo with Gemma-2-9b-it. The work provides a practical, cost-efficient enhancement to RLHF that can augment existing loss functions and datasets, though it acknowledges an remaining gap to fully on-policy performance and calls for broader preference data to cover safety and multi-turn scenarios.

Abstract

Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 76.7% based on Gemma-2-9b-it. We release the code and models at https://github.com/wzhouad/WPO.

WPO: Enhancing RLHF with Weighted Preference Optimization

TL;DR

This paper tackles the distribution gap problem in off-policy RLHF by introducing Weighted Preference Optimization (WPO), which reweights preference pairs by their probability under the current policy to simulate on-policy learning without additional data collection costs. A bootstrapping-inspired data regeneration and a token-level weight alignment mechanism (including greedy and sampled variants, with sampled alignment as the default) enable stable, on-policy-like optimization using off-policy data. Empirical results on Alpaca Eval 2 and MT-bench show WPO consistently beating Direct Preference Optimization (DPO) and achieving strong performance in hybrid RL settings, including a 76.7% length-controlled win rate against GPT-4-turbo with Gemma-2-9b-it. The work provides a practical, cost-efficient enhancement to RLHF that can augment existing loss functions and datasets, though it acknowledges an remaining gap to fully on-policy performance and calls for broader preference data to cover safety and multi-turn scenarios.

Abstract

Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 76.7% based on Gemma-2-9b-it. We release the code and models at https://github.com/wzhouad/WPO.
Paper Structure (16 sections, 12 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 12 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the Weighted Preference Optimization (WPO). Some notations are labeled along with corresponding components. Existing DPO directly optimizes the policy to best satisfy the preferences with off-policy data. In contrast, WPO adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy.
  • Figure 2: Weight distribution of outputs sampled using the policy model with different alignment methods.
  • Figure 3: Results of WPO in different RL settings. The hybrid setting consistently yileds better results than other RL settings.
  • Figure 4: Results of variations of WPO in different RL settings.
  • Figure 5: Results of DPO and WPO when trained for more epochs.