Reinforcement Learning from Diverse Human Preferences
Wanqi Xue, Bo An, Shuicheng Yan, Zhongwen Xu
TL;DR
This paper tackles the challenge of learning from human preferences when feedback is diverse and potentially inconsistent. It introduces a latent-space framework in which reward signals are implicitly manipulated via an encoder–decoder pair, coupled with a fixed prior and a strong KL constraint to stabilize learning from noisy preferences. A confidence-based ensembling mechanism then aggregates multiple reward models to improve stability and reliability. Across DMControl and Meta-world tasks, the method restores near-oracle performance under diverse annotators, demonstrating scalability and practical potential for real-world RL with crowd-sourced feedback. Overall, the approach provides a robust pathway to scalable, preference-driven RL by stabilizing reward learning and leveraging ensemble confidence.
Abstract
The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.
