Inverse Reinforcement Learning by Estimating Expertise of Demonstrators
Mark Beliaev, Ramtin Pedarsani
TL;DR
This work tackles learning from suboptimal and heterogeneous demonstrations in imitation learning by introducing IRLEED, a framework that combines a Boltzmann-based suboptimality model for demonstrators with a Maximum Entropy IRL objective. By jointly estimating a ground-truth reward parameter $\theta$, per-demonstrator biases $\epsilon_i$, and precision $\beta_i$, IRLEED accounts for reward bias and action variance across sources, enabling dynamics-aware recovery of the optimal policy. The approach generalizes standard IRL and ILEED, provides gradient-based update rules, and supports practical neural-network-based extensions with regularization. Empirical results across online and offline tasks, including simulations and human data, show IRLEED achieving higher returns and better reward recovery than baselines, particularly when learning from mixed-quality demonstrations, which has strong implications for crowd-sourced and cross-source IL settings.
Abstract
In Imitation Learning (IL), utilizing suboptimal and heterogeneous demonstrations presents a substantial challenge due to the varied nature of real-world data. However, standard IL algorithms consider these datasets as homogeneous, thereby inheriting the deficiencies of suboptimal demonstrators. Previous approaches to this issue rely on impractical assumptions like high-quality data subsets, confidence rankings, or explicit environmental knowledge. This paper introduces IRLEED, Inverse Reinforcement Learning by Estimating Expertise of Demonstrators, a novel framework that overcomes these hurdles without prior knowledge of demonstrator expertise. IRLEED enhances existing Inverse Reinforcement Learning (IRL) algorithms by combining a general model for demonstrator suboptimality to address reward bias and action variance, with a Maximum Entropy IRL framework to efficiently derive the optimal policy from diverse, suboptimal demonstrations. Experiments in both online and offline IL settings, with simulated and human-generated data, demonstrate IRLEED's adaptability and effectiveness, making it a versatile solution for learning from suboptimal demonstrations.
