Table of Contents
Fetching ...

Crowd-PrefRL: Preference-Based Reward Learning from Crowds

David Chhan, Ellen Novoseller, Vernon J. Lawhern

TL;DR

The paper tackles learning reward functions for RL from crowdsourced pairwise preferences when annotator reliability varies. It introduces Crowd-PrefRL, which uses Spectral Meta-Learner based aggregation to combine crowd feedback with on-policy reward learning via PrefPPO, enabling robust performance and minority viewpoint detection. Empirical results in DMControl environments show that SML-based aggregation often outperforms majority voting or individual users, particularly under high crowd heterogeneity, and can identify distinct user groups who hold divergent objectives. This approach offers practical benefits for scalable, crowd-aware RL and sets the stage for personalized or multi-objective reward balancing based on unsupervised crowd analysis.

Abstract

Preference-based reinforcement learning (RL) provides a framework to train AI agents using human feedback through preferences over pairs of behaviors, enabling agents to learn desired behaviors when it is difficult to specify a numerical reward function. While this paradigm leverages human feedback, it typically treats the feedback as given by a single human user. However, different users may desire multiple AI behaviors and modes of interaction. Meanwhile, incorporating preference feedback from crowds (i.e. ensembles of users) in a robust manner remains a challenge, and the problem of training RL agents using feedback from multiple human users remains understudied. In this work, we introduce a conceptual framework, Crowd-PrefRL, that integrates preference-based RL approaches with techniques from unsupervised crowdsourcing to enable training of autonomous system behaviors from crowdsourced feedback. We show preliminary results suggesting that Crowd-PrefRL can learn reward functions and agent policies from preference feedback provided by crowds of unknown expertise and reliability. We also show that in most cases, agents trained with Crowd-PrefRL outperform agents trained with majority-vote preferences or preferences from any individual user, especially when the spread of user error rates among the crowd is large. Results further suggest that our method can identify the presence of minority viewpoints within the crowd in an unsupervised manner.

Crowd-PrefRL: Preference-Based Reward Learning from Crowds

TL;DR

The paper tackles learning reward functions for RL from crowdsourced pairwise preferences when annotator reliability varies. It introduces Crowd-PrefRL, which uses Spectral Meta-Learner based aggregation to combine crowd feedback with on-policy reward learning via PrefPPO, enabling robust performance and minority viewpoint detection. Empirical results in DMControl environments show that SML-based aggregation often outperforms majority voting or individual users, particularly under high crowd heterogeneity, and can identify distinct user groups who hold divergent objectives. This approach offers practical benefits for scalable, crowd-aware RL and sets the stage for personalized or multi-objective reward balancing based on unsupervised crowd analysis.

Abstract

Preference-based reinforcement learning (RL) provides a framework to train AI agents using human feedback through preferences over pairs of behaviors, enabling agents to learn desired behaviors when it is difficult to specify a numerical reward function. While this paradigm leverages human feedback, it typically treats the feedback as given by a single human user. However, different users may desire multiple AI behaviors and modes of interaction. Meanwhile, incorporating preference feedback from crowds (i.e. ensembles of users) in a robust manner remains a challenge, and the problem of training RL agents using feedback from multiple human users remains understudied. In this work, we introduce a conceptual framework, Crowd-PrefRL, that integrates preference-based RL approaches with techniques from unsupervised crowdsourcing to enable training of autonomous system behaviors from crowdsourced feedback. We show preliminary results suggesting that Crowd-PrefRL can learn reward functions and agent policies from preference feedback provided by crowds of unknown expertise and reliability. We also show that in most cases, agents trained with Crowd-PrefRL outperform agents trained with majority-vote preferences or preferences from any individual user, especially when the spread of user error rates among the crowd is large. Results further suggest that our method can identify the presence of minority viewpoints within the crowd in an unsupervised manner.
Paper Structure (11 sections, 4 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 11 sections, 4 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Crowd-PrefRL framework for training RL agents via crowd feedback. We assume a crowd, or ensemble, of users is queried for their preferences over pairs of segmented behaviors (A vs. B). These crowd preferences are then used to learn aggregate preference labels and reward function(s) for preference-based RL. Note that crowds might consist of users who provide diverse preference feedback according to different objectives (blue vs. orange), which we assume are unknown a priori.
  • Figure 2: Difference in MAJ and SML preference prediction error rates for different levels of variability (standard deviation) in user error rates. Values are calculated across 100 randomly-sampled crowds of 7, 11 and 15 simulated users for the Walker-walk environment. The horizontal dashed line at $y=0$ indicates where the MAJ and SML error rates are the same; points above $y=0$ indicate that the SML labels have lower error than the MAJ labels. Red dots indicate where SML outperforms the best crowd member; blue indicates where the best crowd member outperforms SML.
  • Figure 3: (Top row) Comparison of Crowd-PrefPPO training curves with SML, MAJ and Oracle labeling across two different crowd configurations: [(a), left] a crowd for which SML is expected to outperform MAJ, and [(b), right] a crowd for which SML is expected to perform similarly to MAJ across the three different environments (Walker-walk, Quadruped-walk and Cheetah-run). (Bottom row) Comparison of MAJ and SML label prediction errors at each feedback iteration. Each plot shows the mean $\pm$ standard error of 6 out of 10 runs (the top and bottom 2 runs are omitted to reduce the effect of outliers, as detailed in the Experiment Setup).
  • Figure 4: SML weights and agent performance in MO-Hopper. (a) 25K training steps. Left: Scatter plot of crowd SML weights $\hat{v}_i$ with corresponding user error rates. Right: scatter plot of returns of 100 rollout trajectories with two reward functions, (1) Forward and (2) Vertical Height. (b) Similar to (a) with 100K training steps. At 25K training steps (a), we observe a reliable separation between the minority and majority groups and weakly negatively correlated returns in the two reward functions, indicating that the agent has not yet learned to optimize both objectives. In contrast, at 100K training steps (b), the agent has learned to satisfy both sets of users, as indicated by the large positive correlation in returns, and thus the SML weights no longer separate into two groups.
  • Figure 5: Unsupervised clustering of users based purely on crowd feedback. We consider a crowd of 150 workers with a 110/40 majority/minority split in the MO-Hopper environment. (a) For 25K agent training steps, we see (left) a histogram of SML weights $\hat{v}_i$ overlaid with a fitted GMM, (middle) the BIC values used to automatically infer the number of clusters, and (right) a scatter plot of SML weights vs. ground-truth error showing the accuracy of the inferred grouping. Colors indicate the ground truth grouping (blue for majority, orange for minority). (b) Similar to (a) with 100K training steps. We see that for the agent at 25K training steps (a), a GMM with two Gaussian components reliably models the two groups using only the SML weights $\hat{v}_i$, automatically inferring the number of clusters and performing unsupervised cluster assignment. At 100K training frames (b), the agent can satisfy both sets of users, and so the cluster analysis no longer infers any minority.