Table of Contents
Fetching ...

Learning Optimal Advantage from Preferences and Mistaking it for Reward

W. Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson, Serena Booth, Anca Dragan, Peter Stone, Scott Niekum

TL;DR

This paper analyzes the consequences of assuming human preferences arise from partial returns when they are actually generated by regret. It shows that reward-learning pipelines effectively learn the optimal advantage function $A^*_r$ rather than a true reward $r$, and explores the implications for policy invariance, reward shaping, and data efficiency. The authors demonstrate that treating $A^*_r$ as a reward preserves optimal policies under certain conditions but introduces strong shaping and potential inefficiencies; they advocate Greedy maximization of $A^*_r$ as a simpler, more principled approach. The work reframes RLHF and fine-tuning of large language models under the regret preference model, highlighting both theoretical identifiability benefits and practical considerations for sample efficiency and termination bias. Overall, the paper clarifies why partial-return-based methods perform well in practice and argues for aligning reward-learning with regret-based preferences for better fidelity and efficiency in sequential tasks.

Abstract

We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return. Recent work casts doubt on the validity of this assumption, proposing an alternative preference model based upon regret. We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret. We argue that the learned function is an approximation of the optimal advantage function, $\hat{A^*_r}$, not a reward function. We find that if a specific pitfall is addressed, this incorrect assumption is not particularly harmful, resulting in a highly shaped reward function. Nonetheless, this incorrect usage of $\hat{A^*_r}$ is less desirable than the appropriate and simpler approach of greedy maximization of $\hat{A^*_r}$. From the perspective of the regret preference model, we also provide a clearer interpretation of fine tuning contemporary large language models with RLHF. This paper overall provides insight regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.

Learning Optimal Advantage from Preferences and Mistaking it for Reward

TL;DR

This paper analyzes the consequences of assuming human preferences arise from partial returns when they are actually generated by regret. It shows that reward-learning pipelines effectively learn the optimal advantage function rather than a true reward , and explores the implications for policy invariance, reward shaping, and data efficiency. The authors demonstrate that treating as a reward preserves optimal policies under certain conditions but introduces strong shaping and potential inefficiencies; they advocate Greedy maximization of as a simpler, more principled approach. The work reframes RLHF and fine-tuning of large language models under the regret preference model, highlighting both theoretical identifiability benefits and practical considerations for sample efficiency and termination bias. Overall, the paper clarifies why partial-return-based methods perform well in practice and argues for aligning reward-learning with regret-based preferences for better fidelity and efficiency in sequential tasks.

Abstract

We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return. Recent work casts doubt on the validity of this assumption, proposing an alternative preference model based upon regret. We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret. We argue that the learned function is an approximation of the optimal advantage function, , not a reward function. We find that if a specific pitfall is addressed, this incorrect assumption is not particularly harmful, resulting in a highly shaped reward function. Nonetheless, this incorrect usage of is less desirable than the appropriate and simpler approach of greedy maximization of . From the perspective of the regret preference model, we also provide a clearer interpretation of fine tuning contemporary large language models with RLHF. This paper overall provides insight regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.
Paper Structure (42 sections, 3 theorems, 11 equations, 11 figures, 1 table)

This paper contains 42 sections, 3 theorems, 11 equations, 11 figures, 1 table.

Key Result

Theorem 3.1

$\Pi^*_{\tilde{r}} = \{\pi : \forall s, \forall a ~ [\pi(a|s) > 0 \Leftrightarrow a \in \text{argmax}_a \tilde{r}(s,a)] \}$ if $\text{max}_a \tilde{r}(\cdot,a)=0$.

Figures (11)

  • Figure 1: Three algorithms that are justified by their assumed preference model. The top algorithm was popularized by christiano2017deep and the middle algorithm was proposed by knox2022models. The third algorithm is described in Section \ref{['sec:a-star_as_reward']}. The reward function $\hat{r}$, optimal advantage function $\textbf{A^*_r}$, and optimal policy $\hat{\pi}^*_r$ are approximations of the true versions of these functions. The function $g$ is defined generally in Equation \ref{['eq:logisticonsummation']} to allow it to represent including $A^*_r$ or $r$. This paper focuses on what occurs when the solid box represents the actual algorithm for learning $g$ but the partial return preference model is assumed, causing $\textbf{A^*_r}$ to be used as if it is the reward in the dashed box.
  • Figure 2: Two segments in an undiscounted task with $-1$ reward each time step. The partial return of both segments with respect to the true reward function is $-2$. The regret of the left segment is $4$. The right segment is optimal and therefore has a regret of $0$. The regret preference model is more likely to prefer the right segment---as we suspect our human readers are---whereas the partial return preference model is equally likely to prefer each segment.
  • Figure 3: Performance when noiselessly generated preference datasets do and do not include segments with transitions from absorbing state. Results are across 30 randomly generated gridworld MDPs with tabular representations of the $\textbf{A^*_r}$, where segments of length 3 are chosen by uniformly randomly choosing a start state and 3 its actions. When transitions from absorbing states are not included, any segment that terminates before its final transition is rejected and then resampled. For $greedy~ \textbf{A^*_r}$ (in red) Wilcoxon paired signed-rank tests reveal that including transitions from absorbing state results in significantly higher performance for all training set sizes but the smallest, 300, with $p < 0.0007$. No significant difference in performance is detected for $greedy~Q^*_{\raisebox{0.2ex}{$r_{\hbox{$\widehat{A^*_{\raisebox{0.2ex}{$r$}}}$}}$}}$ with or without terminating transitions except at 30,000 preferences with a more modest $p = 0.04$. Appendix \ref{['app:performance_diff_inv']} contains the plot for stochastically generated preferences (Figure \ref{['fig:absorbing_state_stochastic']}), which contains similar results.
  • Figure 4: Comparing the effect on $greedy~Q^*_{\raisebox{0.2ex}{$r_{\hbox{$\widehat{A^*_{\raisebox{0.2ex}{$r$}}}$}}$}}$ of including transitions from absorbing state. For each state within 30 MDPs, the plots above show the $max_a \textbf{A^*_r} (s,a)$ values. The plot shows that including such transitions moves the resultant maximum values closer to 0. The plot for stochastically generated preferences is similar and can be found in Appendix \ref{['app:absorbing']}. After learning with absorbing transitions, $max_a \textbf{A^*_r} (s,a)$ across all states is stochastically closer to 0 than when learning without them. Wilcoxon paired signed-rank tests at every training set size are all extremely significant with $p < 10^{-7}$.
  • Figure 5: Validation of the hypothesis that maximum partial return by $r_{\hbox{$\widehat{A^*_{\raisebox{0.2ex}{$r$}}}$}}$ across all loops determines the direction of performance differences between $greedy~ \textbf{A^*_r}$ and $greedy~Q^*_{\raisebox{0.2ex}{$r_{\hbox{$\widehat{A^*_{\raisebox{0.2ex}{$r$}}}$}}$}}$. 1080 runs are shown, built from the set of 90 MDPs $\times ~\{10, 100, 1000\}$ preferences in the training set $\times ~\{1,2\}$ segment lengths $\times ~\{ \text{noiselessly},\text{stochastically}\}$ generated preferences. Plot points are colored orange when every $\pi^*_r$ terminates and blue when every $\pi^*_r$ does not terminate. The blue and orange shading of the plot represents where our hypotheses predict circles of each color to be, if $y\neq0$. Returns are standardized across MDPs within $[-1,1]$ (detail in Appendix \ref{['app:experimental_settings']}), and the x axis is the maximum partial return by $r_{\hbox{$\widehat{A^*_{\raisebox{0.2ex}{$r$}}}$}}$ across all loops in the MDP. Of the 75 runs with a performance difference ($y\neq0$), 73 conform to our hypothesis. In the remaining 2 runs, both algorithms achieve near-optimal behavior and therefore have a difference of less than 0.1.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Theorem 3.1: Greedy action is optimal when the maximum reward in every state is 0.
  • Corollary 3.1: Policy invariance of $r_{A^*_r}$
  • Corollary 3.2