Table of Contents
Fetching ...

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Qining Zhang, Lei Ying

TL;DR

This work tackles RLHF without reward inference by introducing two zeroth-order policy-gradient methods, ZPG and ZBCPG, that optimize directly from human trajectory preferences in general RL settings with stochastic transitions. By estimating local value differences from human feedback and applying zeroth-order gradient estimators—full-vector in ZPG and block-coordinate in ZBCPG—the authors prove polynomial convergence rates and demonstrate strong empirical performance against DPO and PPO in stochastic GridWorld environments. The methods accommodate various preference models (e.g., Bradley-Terry and Weibull) and support parallelizable learning, offering a practical reward-free alternative to traditional reward inference pipelines. The results substantiate provable efficiency for RLHF without reward inference and highlight trade-offs related to human-query costs, parallelization, and exploration strategies, pointing to future enhancements such as KL regularization and AI-assisted feedback.

Abstract

Reward inference (learning a reward model from human preferences) is a critical intermediate step in the Reinforcement Learning from Human Feedback (RLHF) pipeline for fine-tuning Large Language Models (LLMs). In practice, RLHF faces fundamental challenges such as distribution shift, reward model overfitting, and problem misspecification. An alternative approach is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which provides a much simpler pipeline and has shown empirical success in LLM applications. However, DPO utilizes the closed-form expression between the optimal policy and the reward function, which is only suitable under the bandit setting or deterministic MDPs. This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDPs, and general preference models beyond the Bradley-Terry model. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. For both algorithms, we establish polynomial convergence rates in terms of the number of policy gradient iterations, the number of trajectory samples, and human preference queries per iteration. Numerical experiments in stochastic environments validate the performance of our proposed algorithms, outperforming popular RLHF baselines such as DPO and PPO. Our paper shows there exist provably efficient methods to solve general RLHF problems without reward inference.

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

TL;DR

This work tackles RLHF without reward inference by introducing two zeroth-order policy-gradient methods, ZPG and ZBCPG, that optimize directly from human trajectory preferences in general RL settings with stochastic transitions. By estimating local value differences from human feedback and applying zeroth-order gradient estimators—full-vector in ZPG and block-coordinate in ZBCPG—the authors prove polynomial convergence rates and demonstrate strong empirical performance against DPO and PPO in stochastic GridWorld environments. The methods accommodate various preference models (e.g., Bradley-Terry and Weibull) and support parallelizable learning, offering a practical reward-free alternative to traditional reward inference pipelines. The results substantiate provable efficiency for RLHF without reward inference and highlight trade-offs related to human-query costs, parallelization, and exploration strategies, pointing to future enhancements such as KL regularization and AI-assisted feedback.

Abstract

Reward inference (learning a reward model from human preferences) is a critical intermediate step in the Reinforcement Learning from Human Feedback (RLHF) pipeline for fine-tuning Large Language Models (LLMs). In practice, RLHF faces fundamental challenges such as distribution shift, reward model overfitting, and problem misspecification. An alternative approach is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which provides a much simpler pipeline and has shown empirical success in LLM applications. However, DPO utilizes the closed-form expression between the optimal policy and the reward function, which is only suitable under the bandit setting or deterministic MDPs. This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDPs, and general preference models beyond the Bradley-Terry model. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. For both algorithms, we establish polynomial convergence rates in terms of the number of policy gradient iterations, the number of trajectory samples, and human preference queries per iteration. Numerical experiments in stochastic environments validate the performance of our proposed algorithms, outperforming popular RLHF baselines such as DPO and PPO. Our paper shows there exist provably efficient methods to solve general RLHF problems without reward inference.
Paper Structure (26 sections, 11 theorems, 114 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 26 sections, 11 theorems, 114 equations, 3 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Choose the perturbation distance $\mu$ and the learning rate $\alpha$ to be chosen as follows: If $M = \Omega(H^2)$ and we randomly pick ${\bm{\theta}}_R$ uniformly from the trajectory $\{{\bm{\theta}}_0, {\bm{\theta}}_1, \cdots, {\bm{\theta}}_{T-1}\}$, then the convergence rate of ZPG satisfies:

Figures (3)

  • Figure 1: A diagram illustrating classic policy-based RLHF and DPO: classic RLHF involves three steps: (i) policy pre-training: pre-train a policy network (agent), (ii) reward inference: collect trajectories from the environment using a behavior policy, query the human comparison for each trajectory pair and train a reward neural network through maximizing the likelihood under the Bradley-Terry model, and (iii) policy training with reward model: train the policy network with reward signals sampled from the reward network. DPO does not train a reward network but directly optimizes the policy network from human preferences.
  • Figure 2: GridWorld with Bradley-Terry Feedback: (a) the trajectory return of ZPG, ZBCPG, and RLHF baselines, and (b) the return of ZBCPG with different parallelization levels. All results are averaged over $10^5$ repetitions of policy evaluation and shaded areas indicate confidence intervals.
  • Figure 3: GridWorld with Weibull Feedback: (a) the return of ZPG, ZBCPG, and RLHF baselines under Weibull human feedback with panel size $M=1000$, and (b) the trajectory return of ZPG, ZBCPG, and RLHF baselines under Weibull human feedback with panel size $M=200$. All results are averaged over $10^5$ repetitions of policy evaluation and shaded areas indicate confidence intervals.

Theorems & Definitions (11)

  • Theorem 1
  • Theorem 2
  • Corollary 1
  • Lemma 1: Concentration of Reward Difference
  • Lemma 2: Concentration of Preference Probability
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • ...and 1 more