Hybrid-AIRL: Enhancing Inverse Reinforcement Learning with Supervised Expert Guidance
Bram Silue, Santiago Amaya-Corredor, Patrick Mannion, Lander Willem, Pieter Libin
TL;DR
The paper tackles learning from expert demonstrations in sparse-reward, imperfect-information settings by extending Adversarial Inverse Reinforcement Learning (AIRL) with supervised expert guidance and stochastic regularization, yielding Hybrid-AIRL (H-AIRL). By integrating a supervised loss into both the policy and the discriminator and injectinguncertainty via decaying Gaussian noise, H-AIRL achieves faster convergence, greater stability, and more faithful reward inference than AIRL. Evaluations on Gymnasium benchmarks and Heads-Up Limit Hold'em (HULHE) poker demonstrate improved sample efficiency and competitive RL performance when leveraging the learned reward. The work suggests that hybrid supervision can robustly scale IRL to real-world, complex domains with sparse rewards and partial observability, though it acknowledges limitations and points to future work on partial observability and disentangled rewards.
Abstract
Adversarial Inverse Reinforcement Learning (AIRL) has shown promise in addressing the sparse reward problem in reinforcement learning (RL) by inferring dense reward functions from expert demonstrations. However, its performance in highly complex, imperfect-information settings remains largely unexplored. To explore this gap, we evaluate AIRL in the context of Heads-Up Limit Hold'em (HULHE) poker, a domain characterized by sparse, delayed rewards and significant uncertainty. In this setting, we find that AIRL struggles to infer a sufficiently informative reward function. To overcome this limitation, we contribute Hybrid-AIRL (H-AIRL), an extension that enhances reward inference and policy learning by incorporating a supervised loss derived from expert data and a stochastic regularization mechanism. We evaluate H-AIRL on a carefully selected set of Gymnasium benchmarks and the HULHE poker setting. Additionally, we analyze the learned reward function through visualization to gain deeper insights into the learning process. Our experimental results show that H-AIRL achieves higher sample efficiency and more stable learning compared to AIRL. This highlights the benefits of incorporating supervised signals into inverse RL and establishes H-AIRL as a promising framework for tackling challenging, real-world settings.
