Table of Contents
Fetching ...

Hybrid-AIRL: Enhancing Inverse Reinforcement Learning with Supervised Expert Guidance

Bram Silue, Santiago Amaya-Corredor, Patrick Mannion, Lander Willem, Pieter Libin

TL;DR

The paper tackles learning from expert demonstrations in sparse-reward, imperfect-information settings by extending Adversarial Inverse Reinforcement Learning (AIRL) with supervised expert guidance and stochastic regularization, yielding Hybrid-AIRL (H-AIRL). By integrating a supervised loss into both the policy and the discriminator and injectinguncertainty via decaying Gaussian noise, H-AIRL achieves faster convergence, greater stability, and more faithful reward inference than AIRL. Evaluations on Gymnasium benchmarks and Heads-Up Limit Hold'em (HULHE) poker demonstrate improved sample efficiency and competitive RL performance when leveraging the learned reward. The work suggests that hybrid supervision can robustly scale IRL to real-world, complex domains with sparse rewards and partial observability, though it acknowledges limitations and points to future work on partial observability and disentangled rewards.

Abstract

Adversarial Inverse Reinforcement Learning (AIRL) has shown promise in addressing the sparse reward problem in reinforcement learning (RL) by inferring dense reward functions from expert demonstrations. However, its performance in highly complex, imperfect-information settings remains largely unexplored. To explore this gap, we evaluate AIRL in the context of Heads-Up Limit Hold'em (HULHE) poker, a domain characterized by sparse, delayed rewards and significant uncertainty. In this setting, we find that AIRL struggles to infer a sufficiently informative reward function. To overcome this limitation, we contribute Hybrid-AIRL (H-AIRL), an extension that enhances reward inference and policy learning by incorporating a supervised loss derived from expert data and a stochastic regularization mechanism. We evaluate H-AIRL on a carefully selected set of Gymnasium benchmarks and the HULHE poker setting. Additionally, we analyze the learned reward function through visualization to gain deeper insights into the learning process. Our experimental results show that H-AIRL achieves higher sample efficiency and more stable learning compared to AIRL. This highlights the benefits of incorporating supervised signals into inverse RL and establishes H-AIRL as a promising framework for tackling challenging, real-world settings.

Hybrid-AIRL: Enhancing Inverse Reinforcement Learning with Supervised Expert Guidance

TL;DR

The paper tackles learning from expert demonstrations in sparse-reward, imperfect-information settings by extending Adversarial Inverse Reinforcement Learning (AIRL) with supervised expert guidance and stochastic regularization, yielding Hybrid-AIRL (H-AIRL). By integrating a supervised loss into both the policy and the discriminator and injectinguncertainty via decaying Gaussian noise, H-AIRL achieves faster convergence, greater stability, and more faithful reward inference than AIRL. Evaluations on Gymnasium benchmarks and Heads-Up Limit Hold'em (HULHE) poker demonstrate improved sample efficiency and competitive RL performance when leveraging the learned reward. The work suggests that hybrid supervision can robustly scale IRL to real-world, complex domains with sparse rewards and partial observability, though it acknowledges limitations and points to future work on partial observability and disentangled rewards.

Abstract

Adversarial Inverse Reinforcement Learning (AIRL) has shown promise in addressing the sparse reward problem in reinforcement learning (RL) by inferring dense reward functions from expert demonstrations. However, its performance in highly complex, imperfect-information settings remains largely unexplored. To explore this gap, we evaluate AIRL in the context of Heads-Up Limit Hold'em (HULHE) poker, a domain characterized by sparse, delayed rewards and significant uncertainty. In this setting, we find that AIRL struggles to infer a sufficiently informative reward function. To overcome this limitation, we contribute Hybrid-AIRL (H-AIRL), an extension that enhances reward inference and policy learning by incorporating a supervised loss derived from expert data and a stochastic regularization mechanism. We evaluate H-AIRL on a carefully selected set of Gymnasium benchmarks and the HULHE poker setting. Additionally, we analyze the learned reward function through visualization to gain deeper insights into the learning process. Our experimental results show that H-AIRL achieves higher sample efficiency and more stable learning compared to AIRL. This highlights the benefits of incorporating supervised signals into inverse RL and establishes H-AIRL as a promising framework for tackling challenging, real-world settings.

Paper Structure

This paper contains 21 sections, 21 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Reward learning curves for AIRL (green) and H-AIRL (red) on Gymnasium benchmarks, alongside an expert PPO baseline (blue).
  • Figure 2: The policy's state-level action alignment with the expert, for AIRL (green) and H-AIRL (red), across benchmarks with discrete action spaces.
  • Figure 3: RL training curves of PPO or DQN agents using environment (blue), AIRL-derived (green), and H-AIRL-derived (red) rewards on Gymnasium benchmarks and Heads-Up Limit Hold'em poker.
  • Figure 4: Preferred actions according to the learned reward functions over the MountainCar state space (position vs. velocity), for each discrete action: "thrust right" (R, blue), "no thrust" (N, green), or "thrust left" (L, red).
  • Figure 5: One‐factor‐at‐a‐time (OFAT) sweeps on MountainCar for H‐AIRL's core hyperparameters: (a) the policy supervision weight $\alpha$, (b) the discriminator supervision weight $\beta$, (c) the initial noise standard deviation $\sigma_{\text{start}}$, and (d) the final noise standard deviation $\sigma_{\text{end}}$. Each curve shows the mean performance and standard deviation over 10 independent runs.