Table of Contents
Fetching ...

Listwise Reward Estimation for Offline Preference-based Reinforcement Learning

Heewoong Choi, Sangwon Jung, Hongjoon Ahn, Taesup Moon

TL;DR

This work tackles the challenge of reward specification in RL by learning rewards from human preferences in an offline setting. It introduces LiRE, which constructs a Ranked List of Trajectories (RLT) using only existing ternary preferences to unlock second-order information about relative preferences. By deriving a richer set of training signals from the RLT and employing a linear score function, LiRE achieves substantial improvements over strong offline PbRL baselines on a newly designed Meta-World/DMControl dataset, with robustness to feedback quantity and noise. The approach offers a practical path to more fine-grained alignment of RL agents with human intent, without requiring expensive new feedback modalities.

Abstract

In Reinforcement Learning (RL), designing precise reward functions remains to be a challenge, particularly when aligning with human intent. Preference-based RL (PbRL) was introduced to address this problem by learning reward models from human feedback. However, existing PbRL methods have limitations as they often overlook the second-order preference that indicates the relative strength of preference. In this paper, we propose Listwise Reward Estimation (LiRE), a novel approach for offline PbRL that leverages second-order preference information by constructing a Ranked List of Trajectories (RLT), which can be efficiently built by using the same ternary feedback type as traditional methods. To validate the effectiveness of LiRE, we propose a new offline PbRL dataset that objectively reflects the effect of the estimated rewards. Our extensive experiments on the dataset demonstrate the superiority of LiRE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise. Our code is available at https://github.com/chwoong/LiRE

Listwise Reward Estimation for Offline Preference-based Reinforcement Learning

TL;DR

This work tackles the challenge of reward specification in RL by learning rewards from human preferences in an offline setting. It introduces LiRE, which constructs a Ranked List of Trajectories (RLT) using only existing ternary preferences to unlock second-order information about relative preferences. By deriving a richer set of training signals from the RLT and employing a linear score function, LiRE achieves substantial improvements over strong offline PbRL baselines on a newly designed Meta-World/DMControl dataset, with robustness to feedback quantity and noise. The approach offers a practical path to more fine-grained alignment of RL agents with human intent, without requiring expensive new feedback modalities.

Abstract

In Reinforcement Learning (RL), designing precise reward functions remains to be a challenge, particularly when aligning with human intent. Preference-based RL (PbRL) was introduced to address this problem by learning reward models from human feedback. However, existing PbRL methods have limitations as they often overlook the second-order preference that indicates the relative strength of preference. In this paper, we propose Listwise Reward Estimation (LiRE), a novel approach for offline PbRL that leverages second-order preference information by constructing a Ranked List of Trajectories (RLT), which can be efficiently built by using the same ternary feedback type as traditional methods. To validate the effectiveness of LiRE, we propose a new offline PbRL dataset that objectively reflects the effect of the estimated rewards. Our extensive experiments on the dataset demonstrate the superiority of LiRE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise. Our code is available at https://github.com/chwoong/LiRE
Paper Structure (42 sections, 8 equations, 10 figures, 18 tables, 1 algorithm)

This paper contains 42 sections, 8 equations, 10 figures, 18 tables, 1 algorithm.

Figures (10)

  • Figure 1: An overview of LiRE. The figure shows an example of a button-press-topdown task. We sample a trajectory segment and sequentially obtain the preference feedback for existing trajectories in RLT. We use binary search to find the correct rank (left) efficiently. Multiple preference pairs are generated from RLT to learn the reward model (right).
  • Figure 2: Scatter plots of the estimated rewards for the segments used for box-close task. The reward models are trained with MR or LiRE using the exp or linear score function. The Pearson correlation coefficient, $r$, is presented.
  • Figure 3: Average success rates of each method while varying the number of preference feedbacks. The black dotted line represents the average success rates when trained with GT reward.
  • Figure 4: Robustness of LiRE w.r.t the feedback noise.
  • Figure 5: Effect of the granularity of preference feedback.
  • ...and 5 more figures