Table of Contents
Fetching ...

Reinforcement Learning from Diverse Human Preferences

Wanqi Xue, Bo An, Shuicheng Yan, Zhongwen Xu

TL;DR

This paper tackles the challenge of learning from human preferences when feedback is diverse and potentially inconsistent. It introduces a latent-space framework in which reward signals are implicitly manipulated via an encoder–decoder pair, coupled with a fixed prior and a strong KL constraint to stabilize learning from noisy preferences. A confidence-based ensembling mechanism then aggregates multiple reward models to improve stability and reliability. Across DMControl and Meta-world tasks, the method restores near-oracle performance under diverse annotators, demonstrating scalability and practical potential for real-world RL with crowd-sourced feedback. Overall, the approach provides a robust pathway to scalable, preference-driven RL by stabilizing reward learning and leveraging ensemble confidence.

Abstract

The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.

Reinforcement Learning from Diverse Human Preferences

TL;DR

This paper tackles the challenge of learning from human preferences when feedback is diverse and potentially inconsistent. It introduces a latent-space framework in which reward signals are implicitly manipulated via an encoder–decoder pair, coupled with a fixed prior and a strong KL constraint to stabilize learning from noisy preferences. A confidence-based ensembling mechanism then aggregates multiple reward models to improve stability and reliability. Across DMControl and Meta-world tasks, the method restores near-oracle performance under diverse annotators, demonstrating scalability and practical potential for real-world RL with crowd-sourced feedback. Overall, the approach provides a robust pathway to scalable, preference-driven RL by stabilizing reward learning and leveraging ensemble confidence.

Abstract

The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.
Paper Structure (12 sections, 10 equations, 7 figures, 1 algorithm)

This paper contains 12 sections, 10 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of our method. (a) There is a team of different annotators with bounded rationality to provide their preferences. Based on the preference data, a reward model is learned and used to provide rewards to an RL agent for policy optimization. (b) The reward models encode an input into a latent space where a strong distribution constraint is applied to address inconsistency issues. Following that, a novel reward model ensembling method is applied to the decoders to aggregate their predictions.
  • Figure 2: Examples for locomotion and robotic manipulation tasks.
  • Figure 3: Learning curves on locomotion tasks (first row) and robotic manipulation tasks (second row). The locomotion tasks are measured on the ground truth episode return while the robotic manipulation tasks are measured on the success rate. The solid line and shaded regions represent the mean and standard deviation, respectively, across five runs.
  • Figure 4: Ablation study on the strength of the latent space constraint. The locomotion tasks (first row) are measured on the ground truth episode return while the robotic manipulation (second row) tasks are measured on the success rate. The results show the mean and standard deviation averaged over five runs.
  • Figure 5: Analysis about the influence of $\phi$ on the reward model. First row: increasing the strength of the constraint will narrow the value range of the predicted rewards. The reward model will also generate more distinct predictions if $\phi$ is large. Second row: the t-SNE visualization of the latent vectors. A large $\phi$ leads to a more compact and concise pattern.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Remark 1