Table of Contents
Fetching ...

B-Pref: Benchmarking Preference-Based Reinforcement Learning

Kimin Lee, Laura Smith, Anca Dragan, Pieter Abbeel

TL;DR

The paper tackles the lack of standardized benchmarks for preference-based reinforcement learning by introducing B-Pref, a package of locomotion and manipulation tasks paired with simulated teachers exhibiting a range of irrationalities. It formalizes evaluation metrics that assess both agent performance and robustness to teacher quirks, and benchmarks two leading methods, PrefPPO and PEBBLE, to reveal how algorithmic design choices like informative-query sampling affect outcomes. The findings show that while existing methods perform well with idealized teachers, their robustness degrades under more realistic, imperfect feedback, underscoring the need for more resilient approaches. By providing open-source code and a diverse suite of scenarios, B-Pref aims to standardize evaluation and stimulate systematic advances in preference-based RL.

Abstract

Reinforcement learning (RL) requires access to a reward function that incentivizes the right behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL provides an alternative: learning policies using a teacher's preferences without pre-defined rewards, thus overcoming concerns associated with reward engineering. However, it is difficult to quantify the progress in preference-based RL due to the lack of a commonly adopted benchmark. In this paper, we introduce B-Pref: a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly, which makes relying on real human input for evaluation prohibitive. At the same time, simulating human input as giving perfect preferences for the ground truth reward function is unrealistic. B-Pref alleviates this by simulating teachers with a wide array of irrationalities, and proposes metrics not solely for performance but also for robustness to these potential irrationalities. We showcase the utility of B-Pref by using it to analyze algorithmic design choices, such as selecting informative queries, for state-of-the-art preference-based RL algorithms. We hope that B-Pref can serve as a common starting point to study preference-based RL more systematically. Source code is available at https://github.com/rll-research/B-Pref.

B-Pref: Benchmarking Preference-Based Reinforcement Learning

TL;DR

The paper tackles the lack of standardized benchmarks for preference-based reinforcement learning by introducing B-Pref, a package of locomotion and manipulation tasks paired with simulated teachers exhibiting a range of irrationalities. It formalizes evaluation metrics that assess both agent performance and robustness to teacher quirks, and benchmarks two leading methods, PrefPPO and PEBBLE, to reveal how algorithmic design choices like informative-query sampling affect outcomes. The findings show that while existing methods perform well with idealized teachers, their robustness degrades under more realistic, imperfect feedback, underscoring the need for more resilient approaches. By providing open-source code and a diverse suite of scenarios, B-Pref aims to standardize evaluation and stimulate systematic advances in preference-based RL.

Abstract

Reinforcement learning (RL) requires access to a reward function that incentivizes the right behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL provides an alternative: learning policies using a teacher's preferences without pre-defined rewards, thus overcoming concerns associated with reward engineering. However, it is difficult to quantify the progress in preference-based RL due to the lack of a commonly adopted benchmark. In this paper, we introduce B-Pref: a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly, which makes relying on real human input for evaluation prohibitive. At the same time, simulating human input as giving perfect preferences for the ground truth reward function is unrealistic. B-Pref alleviates this by simulating teachers with a wide array of irrationalities, and proposes metrics not solely for performance but also for robustness to these potential irrationalities. We showcase the utility of B-Pref by using it to analyze algorithmic design choices, such as selecting informative queries, for state-of-the-art preference-based RL algorithms. We hope that B-Pref can serve as a common starting point to study preference-based RL more systematically. Source code is available at https://github.com/rll-research/B-Pref.

Paper Structure

This paper contains 22 sections, 11 equations, 19 figures, 2 tables, 3 algorithms.

Figures (19)

  • Figure 1: Illustration of preference-based RL. Instead of assuming that the environment provides a (hand-engineered) reward, a teacher provides preferences between the agent's behaviors, and the agent uses this feedback in order to learn the desired behavior.
  • Figure 2: IQM normalized returns with 95% confidence intervals across ten runs. Learning curves and other metrics (median, mean, optimality gap) are in the supplementary material.
  • Figure 3: IQM normalized returns of PEBBLE with various sampling schemes across ten runs on Quadruped. Learning curves and other metrics (median, mean, optimality gap) are in the supplementary material.
  • Figure 4: Learning curves of PEBBLE with different feedback schedules on the oracle teacher. The solid line and shaded regions represent the mean and standard deviation, respectively, across ten runs.
  • Figure 5: (a) Fraction of equally preferable queries (red) and average returns (blue) on the Equal teacher. We use PEBBLE with different sampling schemes on Quadruped given a budget of 2000 queries. Even though a teacher provides more uniform labels, i.e., $y=(0.5, 0.5)$, to uncertainty-based sampling schemes, they achieve higher returns than other sampling schemes. (b/c) Time series of learned reward function (green) and the ground truth reward (red) using rollouts from a policy optimized by PEBBLE. Learned reward functions align with the ground truth rewards in (b) Sweep Into and (c) Walker.
  • ...and 14 more figures