Rating-based Reinforcement Learning
Devin White, Mingkang Wu, Ellen Novoseller, Vernon J. Lawhern, Nicholas Waytowich, Yongcan Cao
TL;DR
This work tackles the challenge of learning rewards in reinforcement learning without explicit reward functions by introducing rating-based RL (RbRL), which uses absolute human ratings on individual trajectories. It formulates a reward model, a normalized trajectory return, and a novel multi-class cross-entropy loss that leverages probabilistic rating predictions tied to rating-category boundaries. Through synthetic and real human experiments, RbRL often outperforms preference-based RL (PbRL) and reduces human labeling effort, while revealing practical trade-offs in the number of rating classes and boundary estimation. The approach offers a scalable, human-in-the-loop alternative for reward learning with potential for fast global guidance and smoother exploration in complex environments.
Abstract
This paper develops a novel rating-based reinforcement learning approach that uses human ratings to obtain human guidance in reinforcement learning. Different from the existing preference-based and ranking-based reinforcement learning paradigms, based on human relative preferences over sample pairs, the proposed rating-based reinforcement learning approach is based on human evaluation of individual trajectories without relative comparisons between sample pairs. The rating-based reinforcement learning approach builds on a new prediction model for human ratings and a novel multi-class loss function. We conduct several experimental studies based on synthetic ratings and real human ratings to evaluate the effectiveness and benefits of the new rating-based reinforcement learning approach.
