Preference as Reward, Maximum Preference Optimization with Importance Sampling

Zaifan Jiang; Xing Huang; Chao Wei

Preference as Reward, Maximum Preference Optimization with Importance Sampling

Zaifan Jiang, Xing Huang, Chao Wei

TL;DR

This paper targets the challenge of aligning large language models with human values while reducing the complexity and instability of RLHF. It introduces Maximum Preference Optimization (MPO), an off-policy method that directly maximizes preference rewards via an importance-sampling perspective and uses offline data to implement forward KL regularization, eliminating the need for a reward model and a reference policy. MPO achieves a synthesis of the RLHF and IPO philosophies, yielding a simpler, memory-efficient training loop with competitive or superior performance on preference benchmarks and improved resistance to overfitting on tasks unrelated to the preference data. The work suggests that off-policy KL regularization via offline data is a practical, scalable path for robust preference alignment in LLMs, with future directions including data-weight balancing and regulation to prevent overfitting to reference data.

Abstract

Preference learning is a key technology for aligning language models with human values. Reinforcement Learning from Human Feedback (RLHF) is a model-based algorithm to optimize preference learning, which first fits a reward model for preference scores and then optimizes the generating policy with an on-policy PPO algorithm to maximize the reward. The processing of RLHF is complex, time-consuming, and unstable. The Direct Preference Optimization (DPO) algorithm uses an off-policy algorithm to directly optimize the generating policy and eliminates the need for a reward model. DPO is more data-efficient and stable. However, DPO has a drawback of overfitting to the preference data and ignoring the KL-regularization term when the preference is deterministic. Identity mapping Preference Optimization(IPO) uses a root-finding MSE loss to incorporate KL-regularization. However, both DPO and IPO fail to properly address the KL-regularization term because the support of the preference distribution is not equal to the reference distribution. In this paper, we propose a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call Maximum Preference Optimization (MPO). MPO incorporates the off-policy KL-regularization term, making regularization truly effective. MPO achieves the best of both worlds by combining the objectives of RLHF and IPO while being an off-policy algorithm. Furthermore, MPO eliminates the need for a reward model and reference policy, simplifying the learning process and reducing memory usage.

Preference as Reward, Maximum Preference Optimization with Importance Sampling

TL;DR

Abstract

Paper Structure (27 sections, 2 theorems, 37 equations, 1 figure, 3 tables, 1 algorithm)

This paper contains 27 sections, 2 theorems, 37 equations, 1 figure, 3 tables, 1 algorithm.

Introduction
Preliminaries
Pretraining and SFT phase
Preference data collection phase
Reinforcement-learning optimization phase
Local distribution introduced by a preference pair $\langle x, y_w, y_l\rangle$
Background
Reinforcement Learning from Human Feedback (RLHF)
Reward estimation from preference data
Reward maximization using PPO algorithm
Direct Preference Optimization (DPO)
$\Psi$-PO with identity mapping (IPO)
Method
Preference(reward) Maximization with Importance Sampling
Off-policy Preference Learning under KL-regularation
...and 12 more sections

Key Result

Theorem 4.1

Gradient of preference(reward) maximization objective max obj can be estimated from $\mathcal{D}^p$

Figures (1)

Figure 1: Maximum Preference Optimization (MPO) direct optimize preference maximization on preference data using off-policy algorithm, and use offline SFT, pretrain data to make KL-regularation truly effective, which also eliminate the needs for both reward model and reference policy.

Theorems & Definitions (3)

Theorem 4.1
Theorem A.1
proof

Preference as Reward, Maximum Preference Optimization with Importance Sampling

TL;DR

Abstract

Preference as Reward, Maximum Preference Optimization with Importance Sampling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (3)