Human-Inspired Multi-Level Reinforcement Learning
Mingkang Wu, Devin White, Vernon Lawhern, Nicholas R. Waytowich, Yongcan Cao
TL;DR
The paper tackles reward specification challenges in reinforcement learning by introducing a human-inspired, two-level approach that leverages rating-based signals for low-level reward estimation and a KL-divergence-based mechanism for high-level policy direction. The method preserves the existing RbRL framework while adding a modular, distribution-based policy loss that pushes the agent away from lower-rated experiences in a graded manner. Empirical results on six DeepMind Control Suite tasks show that the proposed RbRL-KL approach generally improves performance over standard RbRL, especially as the number of rating classes increases. This work highlights how human ratings can simultaneously shape both reward signals and policy trajectories, offering a path toward more data-efficient, human-in-the-loop reinforcement learning in reward-free or sparsely rewarded settings.
Abstract
Reinforcement learning (RL), a common tool in decision making, learns control policies from various experiences based on the associated cumulative return/rewards without treating them differently. Humans, on the contrary, often learn to distinguish from discrete levels of performance and extract the underlying insights/information (beyond reward signals) towards their decision optimization. For instance, when learning to play tennis, a human player does not treat all unsuccessful attempts equally. Missing the ball completely signals a more severe mistake than hitting it out of bounds (although the cumulative rewards can be similar for both cases). Learning effectively from multi-level experiences is essential in human decision making. This motivates us to develop a novel multi-level RL method that learns from multi-level experiences via extracting multi-level information. At the low level of information extraction, we utilized the existing rating-based reinforcement learning to infer inherent reward signals that illustrate the value of states or state-action pairs accordingly. At the high level of information extraction, we propose to extract important directional information from different-level experiences so that policies can be updated towards desired deviation from these different levels of experiences. Specifically, we propose a new policy loss function that penalizes distribution similarities between the current policy and different-level experiences, and assigns different weights to the penalty terms based on the performance levels. Furthermore, the integration of the two levels towards multi-level RL guides the agent toward policy improvements that benefit both reward improvement and policy improvement, hence yielding a similar learning mechanism as humans.
