Conservative Bias in Multi-Teacher Learning: Why Agents Prefer Low-Reward Advisors
Maher Mesto, Francisco Cruz
TL;DR
This work reveals a surprising conservative bias in multi-teacher reinforcement learning, where agents preferentially consult low-reward, high-consistency teachers over high-reward but riskier ones. Using a grid-world setup with concept drift and two experimental regimes, the authors show robust phase transitions at teacher availability $\rho$ and accuracy $\omega$ benchmarks around 0.6, and they demonstrate substantial performance gains when guidance is consistent and reliable. The findings imply that risk-aware selection, captured by cumulative reward tracking, can emerge as a safety mechanism and influence human-robot collaboration and training paradigms in safety-critical domains. The study also highlights the counterintuitive role of goal perception uncertainty as a regulariser, suggesting nuanced interactions between advice quality, consultation frequency, and adaptation speed under non-stationary conditions.
Abstract
Interactive reinforcement learning (IRL) has shown promise in enabling autonomous agents and robots to learn complex behaviours from human teachers, yet the dynamics of teacher selection remain poorly understood. This paper reveals an unexpected phenomenon in IRL: when given a choice between teachers with different reward structures, learning agents overwhelmingly prefer conservative, low-reward teachers (93.16% selection rate) over those offering 20x higher rewards. Through 1,250 experimental runs in navigation tasks with multiple expert teachers, we discovered: (1) Conservative bias dominates teacher selection: agents systematically choose the lowest-reward teacher, prioritising consistency over optimality; (2) Critical performance thresholds exist at teacher availability rho >= 0.6 and accuracy omega >= 0.6, below which the framework fails catastrophically; (3) The framework achieves 159% improvement over baseline Q-learning under concept drift. These findings challenge fundamental assumptions about optimal teaching in RL and suggest potential implications for human-robot collaboration, where human preferences for safety and consistency may align with the observed agent selection behaviour, potentially informing training paradigms for safety-critical robotic applications.
