Table of Contents
Fetching ...

PILAF: Optimal Human Preference Sampling for Reward Modeling

Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng, Julia Kempe, Yaqi Duan

TL;DR

Policy-Interpolated Learning for Aligned Feedback is proposed, a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward.

Abstract

As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.

PILAF: Optimal Human Preference Sampling for Reward Modeling

TL;DR

Policy-Interpolated Learning for Aligned Feedback is proposed, a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward.

Abstract

As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.

Paper Structure

This paper contains 66 sections, 14 theorems, 150 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 4.1

Using data collected from our proposed response sampling scheme T-PILAF, the gradient of $\mathcal{L}(\theta)$ satisfies where the constant $\overline{Z}_{\theta}$ is defined in equation eq:weight, and the term $T_2$ represents a second-order error.

Figures (6)

  • Figure 1: Overview of our approach. (a) We consider a full RLHF training setup, where a language model (LM) policy is iteratively refined through active data collection. Our goal is to develop an optimal response sampling method for preference labeling. (b) We introduce PILAF, which generates responses by interpolating between the current policy and a reference policy, balancing exploration and exploitation. (c) Our theoretical analysis shows that T-PILAF aligns the parameter gradient with the steepest direction for maximizing human values and achieves more favorable convergence in regions of high sensitivity.
  • Figure 2: Reward-KL curve for Iterative DPO. All training runs start from the same model obtained at the end of the first iteration via Vanilla Sampling. Each dot represents an evaluation performed every 50 training steps.
  • Figure 3: Reward-KL curve for Online DPO. Each dot represents an evaluation performed every 50 training steps.
  • Figure 4: Online DPO with an overfitted initial policy. Each dot represents an evaluation performed every 50 training steps. Color saturation indicates the training step, with darker colors representing later steps.
  • Figure 5: Online DPO with an overfitted initial policy. Full results of the \ref{['fig:online_dpo_special']}. Each dot represents an evaluation performed every 50 training steps. Color saturation indicates the training step, with darker colors representing later steps.
  • ...and 1 more figures

Theorems & Definitions (14)

  • Theorem 4.1: Gradient structure in DPO training
  • Theorem 4.2
  • Theorem 4.3
  • Lemma 4.4
  • Theorem B.1: Gradient structure in DPO training
  • Lemma B.2: Gradient of value $J(\uppi_{\theta})$
  • Lemma B.3: Gradient of the loss function $\mathcal{L}(\theta)$
  • Theorem B.4
  • Lemma B.5
  • Lemma B.6
  • ...and 4 more