B-Pref: Benchmarking Preference-Based Reinforcement Learning
Kimin Lee, Laura Smith, Anca Dragan, Pieter Abbeel
TL;DR
The paper tackles the lack of standardized benchmarks for preference-based reinforcement learning by introducing B-Pref, a package of locomotion and manipulation tasks paired with simulated teachers exhibiting a range of irrationalities. It formalizes evaluation metrics that assess both agent performance and robustness to teacher quirks, and benchmarks two leading methods, PrefPPO and PEBBLE, to reveal how algorithmic design choices like informative-query sampling affect outcomes. The findings show that while existing methods perform well with idealized teachers, their robustness degrades under more realistic, imperfect feedback, underscoring the need for more resilient approaches. By providing open-source code and a diverse suite of scenarios, B-Pref aims to standardize evaluation and stimulate systematic advances in preference-based RL.
Abstract
Reinforcement learning (RL) requires access to a reward function that incentivizes the right behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL provides an alternative: learning policies using a teacher's preferences without pre-defined rewards, thus overcoming concerns associated with reward engineering. However, it is difficult to quantify the progress in preference-based RL due to the lack of a commonly adopted benchmark. In this paper, we introduce B-Pref: a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly, which makes relying on real human input for evaluation prohibitive. At the same time, simulating human input as giving perfect preferences for the ground truth reward function is unrealistic. B-Pref alleviates this by simulating teachers with a wide array of irrationalities, and proposes metrics not solely for performance but also for robustness to these potential irrationalities. We showcase the utility of B-Pref by using it to analyze algorithmic design choices, such as selecting informative queries, for state-of-the-art preference-based RL algorithms. We hope that B-Pref can serve as a common starting point to study preference-based RL more systematically. Source code is available at https://github.com/rll-research/B-Pref.
