A Bayesian Approach to Robust Inverse Reinforcement Learning
Ran Wei, Siliang Zeng, Chenliang Li, Alfredo Garcia, Anthony McDonald, Mingyi Hong
TL;DR
The paper addresses offline model-based IRL when the deployment dynamics are uncertain by proposing a Bayesian framework that simultaneously infers the expert's reward and their internal dynamics, guided by priors parameterized by a dynamics-accuracy precision $\lambda$. By formulating a MAP objective and deriving scalable gradient-based optimization (BM-IRL) and robust variants (RM-IRL), the authors show that prioritizing accurate internal models induces policy robustness against model errors, effectively training against worst-case dynamics outside the offline data distribution. Theoretical performance guarantees quantify the trade-off between policy estimation error and dynamics estimation error, and empirical results on Gridworld and MuJoCo benchmarks demonstrate state-of-the-art performance without requiring hand-crafted pessimistic penalties. The approach provides a principled mechanism to recover robust policies and offers insights for learning from suboptimal or biased demonstrations, with practical implications for real-world offline IRL tasks. All math is expressed with explicit notation, and the framework is validated across both synthetic and high-dimensional continuous-control domains.
Abstract
We consider a Bayesian approach to offline model-based inverse reinforcement learning (IRL). The proposed framework differs from existing offline model-based IRL approaches by performing simultaneous estimation of the expert's reward function and subjective model of environment dynamics. We make use of a class of prior distributions which parameterizes how accurate the expert's model of the environment is to develop efficient algorithms to estimate the expert's reward and subjective dynamics in high-dimensional settings. Our analysis reveals a novel insight that the estimated policy exhibits robust performance when the expert is believed (a priori) to have a highly accurate model of the environment. We verify this observation in the MuJoCo environments and show that our algorithms outperform state-of-the-art offline IRL algorithms.
