Table of Contents
Fetching ...

Active teacher selection for reinforcement learning from human feedback

Rachel Freedman, Justin Svegliato, Kyle Wray, Stuart Russell

TL;DR

The paper tackles the limitation of RLHF systems assuming a single teacher by introducing Hidden Utility Bandits (HUB), a framework that models multiple teachers with differing rationality and costs to learn a shared utility function. It proposes Active Teacher Selection (ATS), which constructs a HUB-POMDP and solves it with POMCPW to decide when and which teacher to query, thereby maximizing the discounted sum of utilities while efficiently acquiring informative feedback. The authors provide theoretical results on naive inference convergence and query complexity, and demonstrate empirically that ATS outperforms baselines in a paper-conference recommendation task and a COVID-19 vaccine testing scenario, highlighting the value of leveraging teacher heterogeneity for robust reward modeling. This work has practical implications for scalable, safe, and reliable reward learning in systems that must integrate diverse human feedback, and it lays groundwork for future integration with hierarchical planning and state abstraction to scale to complex domains.

Abstract

Reinforcement learning from human feedback (RLHF) enables machine learning systems to learn objectives from human feedback. A core limitation of these systems is their assumption that all feedback comes from a single human teacher, despite querying a range of distinct teachers. We propose the Hidden Utility Bandit (HUB) framework to model differences in teacher rationality, expertise, and costliness, formalizing the problem of learning from multiple teachers. We develop a variety of solution algorithms and apply them to two real-world domains: paper recommendation systems and COVID-19 vaccine testing. We find that the Active Teacher Selection (ATS) algorithm outperforms baseline algorithms by actively selecting when and which teacher to query. The HUB framework and ATS algorithm demonstrate the importance of leveraging differences between teachers to learn accurate reward models, facilitating future research on active teacher selection for robust reward modeling.

Active teacher selection for reinforcement learning from human feedback

TL;DR

The paper tackles the limitation of RLHF systems assuming a single teacher by introducing Hidden Utility Bandits (HUB), a framework that models multiple teachers with differing rationality and costs to learn a shared utility function. It proposes Active Teacher Selection (ATS), which constructs a HUB-POMDP and solves it with POMCPW to decide when and which teacher to query, thereby maximizing the discounted sum of utilities while efficiently acquiring informative feedback. The authors provide theoretical results on naive inference convergence and query complexity, and demonstrate empirically that ATS outperforms baselines in a paper-conference recommendation task and a COVID-19 vaccine testing scenario, highlighting the value of leveraging teacher heterogeneity for robust reward modeling. This work has practical implications for scalable, safe, and reliable reward learning in systems that must integrate diverse human feedback, and it lays groundwork for future integration with hierarchical planning and state abstraction to scale to complex domains.

Abstract

Reinforcement learning from human feedback (RLHF) enables machine learning systems to learn objectives from human feedback. A core limitation of these systems is their assumption that all feedback comes from a single human teacher, despite querying a range of distinct teachers. We propose the Hidden Utility Bandit (HUB) framework to model differences in teacher rationality, expertise, and costliness, formalizing the problem of learning from multiple teachers. We develop a variety of solution algorithms and apply them to two real-world domains: paper recommendation systems and COVID-19 vaccine testing. We find that the Active Teacher Selection (ATS) algorithm outperforms baseline algorithms by actively selecting when and which teacher to query. The HUB framework and ATS algorithm demonstrate the importance of leveraging differences between teachers to learn accurate reward models, facilitating future research on active teacher selection for robust reward modeling.
Paper Structure (33 sections, 9 theorems, 10 equations, 10 figures)

This paper contains 33 sections, 9 theorems, 10 equations, 10 figures.

Key Result

Theorem 3.2

If the predicted utility function $\hat{\mathcal{U}}$ and the predicted arm distribution $\hat{\mathcal{D}^{\mathcal{C}}}$ are estimated by executing Algorithm alg:naive with $T$ samples, then $\hat{\mathcal{U}}\rightarrow\mathcal{U}^*$ and $\hat{\mathcal{D}^{\mathcal{C}}}\rightarrow\mathcal{D}^{\ma

Figures (10)

  • Figure 1: A simple Hidden Utility Bandit (HUB) with two arms and two teachers. The agent pulls the first arm, observes an apple, and receives the apple's utility of $8$ without observing it. The agent then pulls the second arm, observes a banana, and receives the banana's utility of $2$ without observing it. Because these utilities are hidden, the agent foregoes the opportunity for utility on the third timestep to ask the expert teacher which fruit is better. The expert replies that apples are better than bananas, so the agent pulls the first arm to maximize apples for all remaining timesteps.
  • Figure 2: Paper recommendation as a HUB problem. Paper categories (Application, Benchmark, Theory) are items ($\mathcal{I})$, professors are teachers with rationality ($\beta$) and cost ($F$) parameters, conferences are arms with distributions ($\mathcal{D})$, and relevance scores are utilities ($\mathcal{U}$). The goal is to recommend the most relevant conferences to read papers from.
  • Figure 3: Comparison of ATS, naive and random algorithms. ATS best maximizes discounted reward (a) and identifies the highest-reward arm more often than most baselines and comparably with Naive[100] and Naive[200], which explore more and earn less reward (b). ATS initially queries teachers less often than naive baselines, but continues querying teachers throughout the episode (c). All data is averaged across 25 runs on 20 HUB problems and smoothed over 10 steps.
  • Figure 4: Accuracy of reward learning using ATS (with specific and general teacher selection) and naive algorithms (with exploration parameters of 50, 100, and 200). ATS with specific teacher selection learns both the underlying utility function (a) and the expected rewards of each arm (b) much more accurately than ATS with general teacher selection and naive algorithms. The middle line is the median, boxes are the IQR, whiskers are $1.5$ times the IQR, and diamonds are outliers.
  • Figure 5: Mean action frequencies for various algorithms. $c$ actions are arm pulls and $\beta$ actions are teacher queries. Data is averaged across 25 runs of 20 HUB problems and smoothed over 10 steps.
  • ...and 5 more figures

Theorems & Definitions (16)

  • Definition 3.1
  • Theorem 3.2
  • Definition 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Theorem \ref{the:convergence}
  • proof : Proof (Sketch).
  • Theorem \ref{the:upper}
  • proof
  • ...and 6 more