Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference
Qining Zhang, Lei Ying
TL;DR
This work tackles RLHF without reward inference by introducing two zeroth-order policy-gradient methods, ZPG and ZBCPG, that optimize directly from human trajectory preferences in general RL settings with stochastic transitions. By estimating local value differences from human feedback and applying zeroth-order gradient estimators—full-vector in ZPG and block-coordinate in ZBCPG—the authors prove polynomial convergence rates and demonstrate strong empirical performance against DPO and PPO in stochastic GridWorld environments. The methods accommodate various preference models (e.g., Bradley-Terry and Weibull) and support parallelizable learning, offering a practical reward-free alternative to traditional reward inference pipelines. The results substantiate provable efficiency for RLHF without reward inference and highlight trade-offs related to human-query costs, parallelization, and exploration strategies, pointing to future enhancements such as KL regularization and AI-assisted feedback.
Abstract
Reward inference (learning a reward model from human preferences) is a critical intermediate step in the Reinforcement Learning from Human Feedback (RLHF) pipeline for fine-tuning Large Language Models (LLMs). In practice, RLHF faces fundamental challenges such as distribution shift, reward model overfitting, and problem misspecification. An alternative approach is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which provides a much simpler pipeline and has shown empirical success in LLM applications. However, DPO utilizes the closed-form expression between the optimal policy and the reward function, which is only suitable under the bandit setting or deterministic MDPs. This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDPs, and general preference models beyond the Bradley-Terry model. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. For both algorithms, we establish polynomial convergence rates in terms of the number of policy gradient iterations, the number of trajectory samples, and human preference queries per iteration. Numerical experiments in stochastic environments validate the performance of our proposed algorithms, outperforming popular RLHF baselines such as DPO and PPO. Our paper shows there exist provably efficient methods to solve general RLHF problems without reward inference.
