Table of Contents
Fetching ...

Rating-based Reinforcement Learning

Devin White, Mingkang Wu, Ellen Novoseller, Vernon J. Lawhern, Nicholas Waytowich, Yongcan Cao

TL;DR

This work tackles the challenge of learning rewards in reinforcement learning without explicit reward functions by introducing rating-based RL (RbRL), which uses absolute human ratings on individual trajectories. It formulates a reward model, a normalized trajectory return, and a novel multi-class cross-entropy loss that leverages probabilistic rating predictions tied to rating-category boundaries. Through synthetic and real human experiments, RbRL often outperforms preference-based RL (PbRL) and reduces human labeling effort, while revealing practical trade-offs in the number of rating classes and boundary estimation. The approach offers a scalable, human-in-the-loop alternative for reward learning with potential for fast global guidance and smoother exploration in complex environments.

Abstract

This paper develops a novel rating-based reinforcement learning approach that uses human ratings to obtain human guidance in reinforcement learning. Different from the existing preference-based and ranking-based reinforcement learning paradigms, based on human relative preferences over sample pairs, the proposed rating-based reinforcement learning approach is based on human evaluation of individual trajectories without relative comparisons between sample pairs. The rating-based reinforcement learning approach builds on a new prediction model for human ratings and a novel multi-class loss function. We conduct several experimental studies based on synthetic ratings and real human ratings to evaluate the effectiveness and benefits of the new rating-based reinforcement learning approach.

Rating-based Reinforcement Learning

TL;DR

This work tackles the challenge of learning rewards in reinforcement learning without explicit reward functions by introducing rating-based RL (RbRL), which uses absolute human ratings on individual trajectories. It formulates a reward model, a normalized trajectory return, and a novel multi-class cross-entropy loss that leverages probabilistic rating predictions tied to rating-category boundaries. Through synthetic and real human experiments, RbRL often outperforms preference-based RL (PbRL) and reduces human labeling effort, while revealing practical trade-offs in the number of rating classes and boundary estimation. The approach offers a scalable, human-in-the-loop alternative for reward learning with potential for fast global guidance and smoother exploration in complex environments.

Abstract

This paper develops a novel rating-based reinforcement learning approach that uses human ratings to obtain human guidance in reinforcement learning. Different from the existing preference-based and ranking-based reinforcement learning paradigms, based on human relative preferences over sample pairs, the proposed rating-based reinforcement learning approach is based on human evaluation of individual trajectories without relative comparisons between sample pairs. The rating-based reinforcement learning approach builds on a new prediction model for human ratings and a novel multi-class loss function. We conduct several experimental studies based on synthetic ratings and real human ratings to evaluate the effectiveness and benefits of the new rating-based reinforcement learning approach.
Paper Structure (23 sections, 4 equations, 8 figures)

This paper contains 23 sections, 4 equations, 8 figures.

Figures (8)

  • Figure 1: Performance of RbRL in synthetic experiments for different $n$, compared to PbRL: mean reward $\pm$ standard error over 10 runs for Walker (top) and Quadruped (bottom).
  • Figure 2: RbRL performance for different $n$ in a human experiment: performance in the Cheetah environment (mean $\pm$ standard error over 3 experiment runs).
  • Figure 3: Performance of RbRL and PbRL in the human user study: Cheetah (top) and Swimmer (bottom). For non-expert users, the plots show mean $\pm$ standard error over 7 users. The expert results are each over a single experiment run.
  • Figure 4: RbRL and PbRL performance for the top 3 (non-expert) user study participants: mean reward $\pm$ standard error over the 3 experiment runs each for Cheetah (top) and Swimmer (bottom).
  • Figure 5: Participants' responses to survey questions about RbRL and PbRL. The set of survey questions is detailed in the Appendix. The blue bar indicates the median and the edges depict the 1st quartile (left) and 3rd quartile (right).
  • ...and 3 more figures