Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

Katherine Metcalf; Miguel Sarabia; Natalie Mackraz; Barry-John Theobald

Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

Katherine Metcalf, Miguel Sarabia, Natalie Mackraz, Barry-John Theobald

TL;DR

The paper tackles the difficulty of learning reward functions in reinforcement learning when relying on human preferences, proposing dynamics-aware rewards (REED) to boost sample efficiency. REED integrates a self-supervised temporal-consistency objective (SPR) to learn a dynamics-aware state-action representation $z^{sa}$, which is then used to bootstrap a reward function $\,\hat{r}_{\\psi}$ via a linear predictor. Empirically, REED yields substantial improvements in sample efficiency across locomotion and manipulation tasks, achieving parity with baselines that use an order of magnitude more preference labels (e.g., 50 vs 500 labels) and showing robust performance under label noise. The contributions demonstrate that explicitly encoding environment dynamics in the reward model can markedly accelerate PbRL and enhance robustness, with gains observed across state-space and image-space observations and across diverse labeling strategies.

Abstract

Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that dynamics-aware reward functions improve the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) learning a dynamics-aware state-action representation (z^{sa}) via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from (z^{sa}), which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83\% and 66\% of ground truth reward policy performance versus only 38\% and 21\%. The performance gains demonstrate the benefits of explicitly learning a dynamics-aware reward model. Repo: \texttt{https://github.com/apple/ml-reed}.

Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

TL;DR

, which is then used to bootstrap a reward function

via a linear predictor. Empirically, REED yields substantial improvements in sample efficiency across locomotion and manipulation tasks, achieving parity with baselines that use an order of magnitude more preference labels (e.g., 50 vs 500 labels) and showing robust performance under label noise. The contributions demonstrate that explicitly encoding environment dynamics in the reward model can markedly accelerate PbRL and enhance robustness, with gains observed across state-space and image-space observations and across diverse labeling strategies.

Abstract

Paper Structure (34 sections, 5 equations, 18 figures, 11 tables, 1 algorithm)

This paper contains 34 sections, 5 equations, 18 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Preference-based Reinforcement Learning
Dynamics-Aware Reward Function
Rewards Encoding Environment Dynamics (REED)
Incorporating REED into PbRL
Experimental Setup
Results
Source of Improvements
Discussion and Limitations
Conclusion
Disagreement Sampling
Labelling Strategies
REED Algorithm
Architectures
...and 19 more sections

Figures (18)

Figure 1: Architecture for self-predictive representation (SPR) objective schwarzer2020data (in yellow), and preference-learned reward function (in blue). Modules in green are shared between SPR and the preference-learned reward function.
Figure 2: Learning curves for three DMC and two MetaWorld tasks with 50 and 500 (DMC) and 2.5k and 10k (MetaWorld) pieces of feedback, for state-space observations, disagreement sampling, and oracle labels. Refer to Appendices \ref{['app_subsec:state_learning_curves']} and \ref{['app_subsec:image_learning curves']} for more tasks and feedback amounts.
Figure 3: Mean normalized return across oracle, noisy, mistake, and equal labellers lee2021bpref on quadruped-walk with state-space observations for 50, 500, and 1000 pieces of feedback.
Figure 4: Learning curves for three DMC and two MetaWorld tasks with 50 and 500 (DMC) and 2.5k and 10k (MetaWorld) pieces of feedback, for image--space observations, disagreement sampling, and oracle labels. Only PEBBLE is evaluated for the image-space due to the poor state-space performance of PrefPPO. Results for more tasks and feedback amounts are available in Appendices \ref{['app_subsec:state_learning_curves']} and \ref{['app_subsec:image_learning curves']}.
Figure 5: Ablation of the SAF reward net for walker-walk, quadruped-walk, sweep into, and button press with 500 (walker and quadruped) and 5k (sweep into and button press) teacher-labelled queries with disagreement-based sampling and the oracle labelling strategy.
...and 13 more figures

Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

TL;DR

Abstract

Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

Authors

TL;DR

Abstract

Table of Contents

Figures (18)