Best Policy Learning from Trajectory Preference Feedback

Akhil Agnihotri; Rahul Jain; Deepak Ramachandran; Zheng Wen

Best Policy Learning from Trajectory Preference Feedback

Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen

TL;DR

This paper tackles Best Policy Identification in trajectory-based reinforcement learning using offline and online trajectory preferences. It introduces Posterior Sampling for Preference Learning (PSPL), a Bayesian algorithm that maintains posteriors over both the reward vector $\theta$ and the transition dynamics $\eta$, leveraging an offline preference dataset to form informed priors and performing pure exploration online. The authors prove Bayesian simple regret guarantees for PSPL and propose a practical Bootstrapped-PSPL approximation based on perturbations of the MAP objective to enable scalable posterior sampling. Empirically, PSPL outperforms baselines on classic control benchmarks and a text-to-image generation task, with results improving as offline data size and rater competence grow, illustrating the method’s robustness and practical potential for RLHF-aligned systems.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based Reinforcement Learning (PbRL) offers a more robust alternative by directly leveraging noisy binary comparisons over trajectories. We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models, for example, during multi-turn interactions. Learning in this setting combines an offline preference dataset--potentially biased or out-of-distribution and collected from a rater of subpar 'competence'--with online pure exploration, making systematic online learning essential. To this end, we propose Posterior Sampling for Preference Learning ($\mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling that maintains posteriors over the reward model and dynamics. We provide the first Bayesian simple regret guarantees for PbRL and introduce an efficient approximation that outperforms existing baselines on simulation and image generation benchmarks.

Best Policy Learning from Trajectory Preference Feedback

TL;DR

and the transition dynamics

, leveraging an offline preference dataset to form informed priors and performing pure exploration online. The authors prove Bayesian simple regret guarantees for PSPL and propose a practical Bootstrapped-PSPL approximation based on perturbations of the MAP objective to enable scalable posterior sampling. Empirically, PSPL outperforms baselines on classic control benchmarks and a text-to-image generation task, with results improving as offline data size and rater competence grow, illustrating the method’s robustness and practical potential for RLHF-aligned systems.

Abstract

), a novel algorithm inspired by Top-Two Thompson Sampling that maintains posteriors over the reward model and dynamics. We provide the first Bayesian simple regret guarantees for PbRL and introduce an efficient approximation that outperforms existing baselines on simulation and image generation benchmarks.

Best Policy Learning from Trajectory Preference Feedback

TL;DR

Abstract

Best Policy Learning from Trajectory Preference Feedback

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (36)