Table of Contents
Fetching ...

Hindsight PRIORs for Reward Learning from Human Preferences

Mudit Verma, Katherine Metcalf

TL;DR

Hindsight PRIOR is introduced, a credit assignment strategy that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance through an auxiliary predicted return redistribution objective.

Abstract

Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors. Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference, which result in data intensive approaches and subpar reward functions. We address such limitations by introducing a credit assignment strategy (Hindsight PRIOR) that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance through an auxiliary predicted return redistribution objective. Incorporating state importance into reward learning improves the speed of policy learning, overall policy performance, and reward recovery on both locomotion and manipulation tasks. For example, Hindsight PRIOR recovers on average significantly (p<0.05) more reward on MetaWorld (20%) and DMC (15%). The performance gains and our ablations demonstrate the benefits even a simple credit assignment strategy can have on reward learning and that state importance in forward dynamics prediction is a strong proxy for a state's contribution to a preference decision. Code repository can be found at https://github.com/apple/ml-rlhf-hindsight-prior.

Hindsight PRIORs for Reward Learning from Human Preferences

TL;DR

Hindsight PRIOR is introduced, a credit assignment strategy that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance through an auxiliary predicted return redistribution objective.

Abstract

Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors. Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference, which result in data intensive approaches and subpar reward functions. We address such limitations by introducing a credit assignment strategy (Hindsight PRIOR) that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance through an auxiliary predicted return redistribution objective. Incorporating state importance into reward learning improves the speed of policy learning, overall policy performance, and reward recovery on both locomotion and manipulation tasks. For example, Hindsight PRIOR recovers on average significantly (p<0.05) more reward on MetaWorld (20%) and DMC (15%). The performance gains and our ablations demonstrate the benefits even a simple credit assignment strategy can have on reward learning and that state importance in forward dynamics prediction is a strong proxy for a state's contribution to a preference decision. Code repository can be found at https://github.com/apple/ml-rlhf-hindsight-prior.
Paper Structure (28 sections, 9 equations, 8 figures, 7 tables)

This paper contains 28 sections, 9 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Hindsight PRIOR augments the existing PbRL cross-entropy loss by encouraging the magnitude of a reward to be proportional to the state's importance. Each reward update preference labelled trajectories are passed to a world model $\hat{\mathcal{T}}$ (yellow) and estimated reward $\hat{r}_\psi$ (red), which assign an importance score and a reward (respectively) to each state-action pair. The return $\hat{G}_\psi$ is then applied to the importance scores, which then serve as auxiliary targets for reward learning.
  • Figure 2: PbRL and SAC policy learning curves for six MetaWorld (top and middle rows) and three DMC (bottom row) tasks. Each experiment is specified as: task / feedback amount.
  • Figure 3: PbRL learning curves over different labelling mistake amounts (left & center : purple & pink for PEBBLE and red & magenta for PRIOR), and different methods for return distribution and dynamics-aware rewards (right).
  • Figure 4: Learning curves evaluating different trajectory lengths (left), combining Hindsight PRIOR with SURF (center), and removing the influence of preference feedback (right).
  • Figure 5: Learning curves of PRIOR, BISIM, RVAR and baseline PEBBLE
  • ...and 3 more figures