Table of Contents
Fetching ...

Conservative Bias in Multi-Teacher Learning: Why Agents Prefer Low-Reward Advisors

Maher Mesto, Francisco Cruz

TL;DR

This work reveals a surprising conservative bias in multi-teacher reinforcement learning, where agents preferentially consult low-reward, high-consistency teachers over high-reward but riskier ones. Using a grid-world setup with concept drift and two experimental regimes, the authors show robust phase transitions at teacher availability $\rho$ and accuracy $\omega$ benchmarks around 0.6, and they demonstrate substantial performance gains when guidance is consistent and reliable. The findings imply that risk-aware selection, captured by cumulative reward tracking, can emerge as a safety mechanism and influence human-robot collaboration and training paradigms in safety-critical domains. The study also highlights the counterintuitive role of goal perception uncertainty as a regulariser, suggesting nuanced interactions between advice quality, consultation frequency, and adaptation speed under non-stationary conditions.

Abstract

Interactive reinforcement learning (IRL) has shown promise in enabling autonomous agents and robots to learn complex behaviours from human teachers, yet the dynamics of teacher selection remain poorly understood. This paper reveals an unexpected phenomenon in IRL: when given a choice between teachers with different reward structures, learning agents overwhelmingly prefer conservative, low-reward teachers (93.16% selection rate) over those offering 20x higher rewards. Through 1,250 experimental runs in navigation tasks with multiple expert teachers, we discovered: (1) Conservative bias dominates teacher selection: agents systematically choose the lowest-reward teacher, prioritising consistency over optimality; (2) Critical performance thresholds exist at teacher availability rho >= 0.6 and accuracy omega >= 0.6, below which the framework fails catastrophically; (3) The framework achieves 159% improvement over baseline Q-learning under concept drift. These findings challenge fundamental assumptions about optimal teaching in RL and suggest potential implications for human-robot collaboration, where human preferences for safety and consistency may align with the observed agent selection behaviour, potentially informing training paradigms for safety-critical robotic applications.

Conservative Bias in Multi-Teacher Learning: Why Agents Prefer Low-Reward Advisors

TL;DR

This work reveals a surprising conservative bias in multi-teacher reinforcement learning, where agents preferentially consult low-reward, high-consistency teachers over high-reward but riskier ones. Using a grid-world setup with concept drift and two experimental regimes, the authors show robust phase transitions at teacher availability and accuracy benchmarks around 0.6, and they demonstrate substantial performance gains when guidance is consistent and reliable. The findings imply that risk-aware selection, captured by cumulative reward tracking, can emerge as a safety mechanism and influence human-robot collaboration and training paradigms in safety-critical domains. The study also highlights the counterintuitive role of goal perception uncertainty as a regulariser, suggesting nuanced interactions between advice quality, consultation frequency, and adaptation speed under non-stationary conditions.

Abstract

Interactive reinforcement learning (IRL) has shown promise in enabling autonomous agents and robots to learn complex behaviours from human teachers, yet the dynamics of teacher selection remain poorly understood. This paper reveals an unexpected phenomenon in IRL: when given a choice between teachers with different reward structures, learning agents overwhelmingly prefer conservative, low-reward teachers (93.16% selection rate) over those offering 20x higher rewards. Through 1,250 experimental runs in navigation tasks with multiple expert teachers, we discovered: (1) Conservative bias dominates teacher selection: agents systematically choose the lowest-reward teacher, prioritising consistency over optimality; (2) Critical performance thresholds exist at teacher availability rho >= 0.6 and accuracy omega >= 0.6, below which the framework fails catastrophically; (3) The framework achieves 159% improvement over baseline Q-learning under concept drift. These findings challenge fundamental assumptions about optimal teaching in RL and suggest potential implications for human-robot collaboration, where human preferences for safety and consistency may align with the observed agent selection behaviour, potentially informing training paradigms for safety-critical robotic applications.

Paper Structure

This paper contains 37 sections, 7 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Teacher selection distribution across 1,250 runs showing strong conservative bias. Teacher 4 (Conservative) dominates with 93.16% selection despite lowest rewards (goal=5 vs 10-100 for others), suggesting risk-averse behaviour valuing consistency over high rewards. Data from 25 configurations, 50 runs each.
  • Figure 2: Impact of concept drift on pure Q-learning. Performance remains poor with drift events preventing learning progress. Without teachers, Q-learning achieves -15.83 average reward and 18.1% success across 50,000 episodes, demonstrating catastrophic failure under non-stationary conditions.
  • Figure 3: Performance heatmap revealing critical thresholds for teacher parameters. Black contour line marks the success boundary (zero reward). Clear phase transition occurs at $\rho \geq 0.6$ and $\omega \geq 0.6$, below which performance degrades to baseline levels. Optimal performance (9.23) achieved at $\rho=1.0$, $\omega=1.0$.
  • Figure 4: Comparative learning trajectories under drift. Best configuration ($\rho=1.0$, $\omega=1.0$, green) achieves 159% improvement over baseline (dashed). Medium ($\rho=0.6$, $\omega=0.6$, orange) reaches success threshold. Worst ($\rho=0.2$, $\omega=0.2$, red) matches baseline. Vertical lines mark drift events.
  • Figure 5: Impact of goal uncertainty on teacher effectiveness. Moderate uncertainty ($\sigma=1.0$) improves performance 348% over perfect knowledge ($\sigma=0.0$). Performance degrades only at high uncertainty ($\sigma \geq 2.0$), suggesting uncertainty provides beneficial regularisation.