Table of Contents
Fetching ...

Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning

Zhengran Ji, Boyuan Chen

TL;DR

This work tackles online reinforcement learning with real-time human feedback that is often noisy and temporally inconsistent. It introduces Pref-GUIDE, which converts scalar feedback into temporally localized pairwise preferences (Pref-GUIDEIndividual) and aggregates reward models across evaluators via voting (Pref-GUIDEVoting). The approach yields more robust reward learning for continual policy training, outperforming scalar-based baselines and sometimes surpassing expert dense rewards in complex tasks. Ablation studies show the moving window and no-preference margin are essential, and population voting provides robustness to evaluator differences. Overall, Pref-GUIDE offers a scalable, principled way to leverage human input for online RL and sustain learning after supervision ends.

Abstract

Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows and filtering ambiguous feedback. Pref-GUIDE Voting further enhances robustness by aggregating reward models across a population of users to form consensus preferences. Across three challenging environments, Pref-GUIDE significantly outperforms scalar-feedback baselines, with the voting variant exceeding even expert-designed dense rewards. By reframing scalar feedback as structured preferences with population feedback, Pref-GUIDE offers a scalable and principled approach for harnessing human input in online reinforcement learning.

Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning

TL;DR

This work tackles online reinforcement learning with real-time human feedback that is often noisy and temporally inconsistent. It introduces Pref-GUIDE, which converts scalar feedback into temporally localized pairwise preferences (Pref-GUIDEIndividual) and aggregates reward models across evaluators via voting (Pref-GUIDEVoting). The approach yields more robust reward learning for continual policy training, outperforming scalar-based baselines and sometimes surpassing expert dense rewards in complex tasks. Ablation studies show the moving window and no-preference margin are essential, and population voting provides robustness to evaluator differences. Overall, Pref-GUIDE offers a scalable, principled way to leverage human input for online RL and sustain learning after supervision ends.

Abstract

Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows and filtering ambiguous feedback. Pref-GUIDE Voting further enhances robustness by aggregating reward models across a population of users to form consensus preferences. Across three challenging environments, Pref-GUIDE significantly outperforms scalar-feedback baselines, with the voting variant exceeding even expert-designed dense rewards. By reframing scalar feedback as structured preferences with population feedback, Pref-GUIDE offers a scalable and principled approach for harnessing human input in online reinforcement learning.

Paper Structure

This paper contains 23 sections, 3 equations, 12 figures, 1 table, 2 algorithms.

Figures (12)

  • Figure 1: Pref-GUIDE. Real-time scalar human feedback is often inconsistent, noisy, and varies across individuals. Pref-GUIDE addresses this by (a) converting scalar feedback into local pairwise preferences to achieve temporal consistency, and (b) aggregating reward models across human evaluators to form consensus-based rewards. (c) These improvements yield more robust reward learning and enable effective continual policy training after human feedback becomes unavailable.
  • Figure 2: Method Overview. (a) Pref-GUIDEIndividual converts real-time scalar feedback from each human evaluator into a localized preference dataset, then trains evaluator-specific reward models. (b) Pref-GUIDEVoting aggregates predictions from these individual models to relabel trajectory pairs through consensus voting, providing a population-informed preference dataset and a robust reward model. (c) The aggregated reward model is used to guide RL training during the post-human-guidance phase.
  • Figure 3: Human feedback is temporally inconsistent for similar trajectories. We visualize t-SNE embeddings of trajectory representations from evaluator 0 in all three tasks. Each point represents a trajectory, color-coded by it's corresponding scalar human feedback. Despite many trajectories being behaviorally similar (i.e., embedded closely), their feedback values vary widely. This observation highlights the temporal drift and inconsistency in human evaluations, motivating our approach to convert scalar feedback into temporally local preference pairs.
  • Figure 4: Results. Each column shows a different task. The top row reports results using only the top 15 evaluators (high-quality feedback), while the bottom row includes all 50 evaluators (mixed feedback). Curves after the dash vertical line denotes the performance during post-human-guidance phase. Pref-GUIDEIndividual outperforms GUIDE when the feedback quality is high, while Pref-GUIDEVoting gives the best results across all conditions and even surpasses expert-designed rewards in more complex tasks.
  • Figure 5: Pref-GUIDEVoting enhances robustness across different evaluators. Each subplot shows the percentage of evaluators (y-axis) whose agents reached specific performance milestones (x-axis) at different time points (title of each plot). Each row shows one task. Pref-GUIDEVoting consistently enables a larger fraction of evaluators to train high-performing agents across milestones and time.
  • ...and 7 more figures