Table of Contents
Fetching ...

Best Policy Learning from Trajectory Preference Feedback

Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen

TL;DR

This paper tackles Best Policy Identification in trajectory-based reinforcement learning using offline and online trajectory preferences. It introduces Posterior Sampling for Preference Learning (PSPL), a Bayesian algorithm that maintains posteriors over both the reward vector $\theta$ and the transition dynamics $\eta$, leveraging an offline preference dataset to form informed priors and performing pure exploration online. The authors prove Bayesian simple regret guarantees for PSPL and propose a practical Bootstrapped-PSPL approximation based on perturbations of the MAP objective to enable scalable posterior sampling. Empirically, PSPL outperforms baselines on classic control benchmarks and a text-to-image generation task, with results improving as offline data size and rater competence grow, illustrating the method’s robustness and practical potential for RLHF-aligned systems.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based Reinforcement Learning (PbRL) offers a more robust alternative by directly leveraging noisy binary comparisons over trajectories. We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models, for example, during multi-turn interactions. Learning in this setting combines an offline preference dataset--potentially biased or out-of-distribution and collected from a rater of subpar 'competence'--with online pure exploration, making systematic online learning essential. To this end, we propose Posterior Sampling for Preference Learning ($\mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling that maintains posteriors over the reward model and dynamics. We provide the first Bayesian simple regret guarantees for PbRL and introduce an efficient approximation that outperforms existing baselines on simulation and image generation benchmarks.

Best Policy Learning from Trajectory Preference Feedback

TL;DR

This paper tackles Best Policy Identification in trajectory-based reinforcement learning using offline and online trajectory preferences. It introduces Posterior Sampling for Preference Learning (PSPL), a Bayesian algorithm that maintains posteriors over both the reward vector and the transition dynamics , leveraging an offline preference dataset to form informed priors and performing pure exploration online. The authors prove Bayesian simple regret guarantees for PSPL and propose a practical Bootstrapped-PSPL approximation based on perturbations of the MAP objective to enable scalable posterior sampling. Empirically, PSPL outperforms baselines on classic control benchmarks and a text-to-image generation task, with results improving as offline data size and rater competence grow, illustrating the method’s robustness and practical potential for RLHF-aligned systems.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based Reinforcement Learning (PbRL) offers a more robust alternative by directly leveraging noisy binary comparisons over trajectories. We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models, for example, during multi-turn interactions. Learning in this setting combines an offline preference dataset--potentially biased or out-of-distribution and collected from a rater of subpar 'competence'--with online pure exploration, making systematic online learning essential. To this end, we propose Posterior Sampling for Preference Learning (), a novel algorithm inspired by Top-Two Thompson Sampling that maintains posteriors over the reward model and dynamics. We provide the first Bayesian simple regret guarantees for PbRL and introduce an efficient approximation that outperforms existing baselines on simulation and image generation benchmarks.

Paper Structure

This paper contains 19 sections, 20 theorems, 71 equations, 7 figures, 1 table, 2 algorithms.

Key Result

Lemma 4.0

For any confidence $\delta_{1} \in (0,\frac{1}{3})$, let $\delta_{2} \in (c,1)$ with $c \in (0,1)$, be the probability that any optimal policy estimate $\widehat{\pi}^{\star}$ constructed from the offline preference dataset ${\mathcal{D}}_{0}$ is $\varepsilon$-optimal with probability at least $(1-\

Figures (7)

  • Figure 1: Comparison of $\mathsf{PSPL}$ with current state-of-the-art offline finetuning algorithms, DPO and IPO, in two benchmark environments. Online finetuning is necessary for BPI. See Appendix \ref{['sec:appendix']} for more details.
  • Figure 2: $\mathsf{PSPL}$ with varying $N$, $\beta$, and $\lambda$ in benchmark environments. Shaded region around mean line represents 1 standard deviation over 5 independent runs.
  • Figure 3: Simple and Cumulative Regret ($\div 10^{3}$) vs $K$ plots. $\mathsf{PSPL}$ is run with $\lambda=50,\beta=10,N=10^{3}$.
  • Figure 4: Sample image generations along with final image reward $\widehat{r}_{\theta}(\cdot)$ over 5 independent runs.
  • Figure 5: Sensitivity to flawed expert policy with $\lambda = \{10, 10^{3}\}$, and misspecified competence.
  • ...and 2 more figures

Theorems & Definitions (36)

  • Remark 2.1
  • Remark 3.1
  • Lemma 4.0
  • Definition 4.1: State Visitation Probability
  • Lemma 4.2
  • Theorem 4.3
  • Remark 4.4
  • Lemma 5.0
  • Lemma A.1: Monotone Contraction
  • proof
  • ...and 26 more