Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards
Katherine Metcalf, Miguel Sarabia, Natalie Mackraz, Barry-John Theobald
TL;DR
The paper tackles the difficulty of learning reward functions in reinforcement learning when relying on human preferences, proposing dynamics-aware rewards (REED) to boost sample efficiency. REED integrates a self-supervised temporal-consistency objective (SPR) to learn a dynamics-aware state-action representation $z^{sa}$, which is then used to bootstrap a reward function $\,\hat{r}_{\\psi}$ via a linear predictor. Empirically, REED yields substantial improvements in sample efficiency across locomotion and manipulation tasks, achieving parity with baselines that use an order of magnitude more preference labels (e.g., 50 vs 500 labels) and showing robust performance under label noise. The contributions demonstrate that explicitly encoding environment dynamics in the reward model can markedly accelerate PbRL and enhance robustness, with gains observed across state-space and image-space observations and across diverse labeling strategies.
Abstract
Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that dynamics-aware reward functions improve the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) learning a dynamics-aware state-action representation (z^{sa}) via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from (z^{sa}), which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83\% and 66\% of ground truth reward policy performance versus only 38\% and 21\%. The performance gains demonstrate the benefits of explicitly learning a dynamics-aware reward model. Repo: \texttt{https://github.com/apple/ml-reed}.
