Model-Based Reward Shaping for Adversarial Inverse Reinforcement Learning in Stochastic Environments
Simon Sinong Zhan, Philip Wang, Qingyuan Wu, Yixuan Wang, Ruochen Jiao, Chao Huang, Qi Zhu
TL;DR
This work tackles the weakness of adversarial IRL methods in stochastic environments by introducing Model-Enhanced AIRL, which injects learned transition dynamics into reward shaping to guarantee policy invariance and improve sample efficiency. The authors formulate a transition-aware reward design and an off-policy, model-based adversarial IRL framework with theoretical guarantees on reward and policy performance under transition-model errors. They establish bounds linking model error to reward learning and value differences, and demonstrate superior sample efficiency and robust performance on MuJoCo tasks with stochastic dynamics and on Atari with high-dimensional observations. Overall, the approach offers a principled, model-based enhancement to IRL that improves robustness to environment uncertainty while remaining competitive in deterministic settings and scalable to complex domains.
Abstract
In this paper, we aim to tackle the limitation of the Adversarial Inverse Reinforcement Learning (AIRL) method in stochastic environments where theoretical results cannot hold and performance is degraded. To address this issue, we propose a novel method which infuses the dynamics information into the reward shaping with the theoretical guarantee for the induced optimal policy in the stochastic environments. Incorporating our novel model-enhanced rewards, we present a novel Model-Enhanced AIRL framework, which integrates transition model estimation directly into reward shaping. Furthermore, we provide a comprehensive theoretical analysis of the reward error bound and performance difference bound for our method. The experimental results in MuJoCo benchmarks show that our method can achieve superior performance in stochastic environments and competitive performance in deterministic environments, with significant improvement in sample efficiency, compared to existing baselines.
