Table of Contents
Fetching ...

A Bayesian Approach to Robust Inverse Reinforcement Learning

Ran Wei, Siliang Zeng, Chenliang Li, Alfredo Garcia, Anthony McDonald, Mingyi Hong

TL;DR

The paper addresses offline model-based IRL when the deployment dynamics are uncertain by proposing a Bayesian framework that simultaneously infers the expert's reward and their internal dynamics, guided by priors parameterized by a dynamics-accuracy precision $\lambda$. By formulating a MAP objective and deriving scalable gradient-based optimization (BM-IRL) and robust variants (RM-IRL), the authors show that prioritizing accurate internal models induces policy robustness against model errors, effectively training against worst-case dynamics outside the offline data distribution. Theoretical performance guarantees quantify the trade-off between policy estimation error and dynamics estimation error, and empirical results on Gridworld and MuJoCo benchmarks demonstrate state-of-the-art performance without requiring hand-crafted pessimistic penalties. The approach provides a principled mechanism to recover robust policies and offers insights for learning from suboptimal or biased demonstrations, with practical implications for real-world offline IRL tasks. All math is expressed with explicit notation, and the framework is validated across both synthetic and high-dimensional continuous-control domains.

Abstract

We consider a Bayesian approach to offline model-based inverse reinforcement learning (IRL). The proposed framework differs from existing offline model-based IRL approaches by performing simultaneous estimation of the expert's reward function and subjective model of environment dynamics. We make use of a class of prior distributions which parameterizes how accurate the expert's model of the environment is to develop efficient algorithms to estimate the expert's reward and subjective dynamics in high-dimensional settings. Our analysis reveals a novel insight that the estimated policy exhibits robust performance when the expert is believed (a priori) to have a highly accurate model of the environment. We verify this observation in the MuJoCo environments and show that our algorithms outperform state-of-the-art offline IRL algorithms.

A Bayesian Approach to Robust Inverse Reinforcement Learning

TL;DR

The paper addresses offline model-based IRL when the deployment dynamics are uncertain by proposing a Bayesian framework that simultaneously infers the expert's reward and their internal dynamics, guided by priors parameterized by a dynamics-accuracy precision . By formulating a MAP objective and deriving scalable gradient-based optimization (BM-IRL) and robust variants (RM-IRL), the authors show that prioritizing accurate internal models induces policy robustness against model errors, effectively training against worst-case dynamics outside the offline data distribution. Theoretical performance guarantees quantify the trade-off between policy estimation error and dynamics estimation error, and empirical results on Gridworld and MuJoCo benchmarks demonstrate state-of-the-art performance without requiring hand-crafted pessimistic penalties. The approach provides a principled mechanism to recover robust policies and offers insights for learning from suboptimal or biased demonstrations, with practical implications for real-world offline IRL tasks. All math is expressed with explicit notation, and the framework is validated across both synthetic and high-dimensional continuous-control domains.

Abstract

We consider a Bayesian approach to offline model-based inverse reinforcement learning (IRL). The proposed framework differs from existing offline model-based IRL approaches by performing simultaneous estimation of the expert's reward function and subjective model of environment dynamics. We make use of a class of prior distributions which parameterizes how accurate the expert's model of the environment is to develop efficient algorithms to estimate the expert's reward and subjective dynamics in high-dimensional settings. Our analysis reveals a novel insight that the estimated policy exhibits robust performance when the expert is believed (a priori) to have a highly accurate model of the environment. We verify this observation in the MuJoCo environments and show that our algorithms outperform state-of-the-art offline IRL algorithms.
Paper Structure (26 sections, 4 theorems, 33 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 26 sections, 4 theorems, 33 equations, 3 figures, 2 tables, 2 algorithms.

Key Result

Theorem 3.1

Let $\epsilon_{\hat{\pi}} = -\mathbb{E}_{(s, a) \sim d_{P}^{\pi}}[\log \hat{\pi}_{\hat{P}}(a|s)]$ be the policy estimation error and $\epsilon_{\hat{P}} = \mathbb{E}_{(s, a) \sim d_{P}^{\pi}}D_{KL}[P(\cdot|s, a) || \hat{P}(\cdot|s, a)]$ be the dynamics estimation error. Let $R_{max} = \max_{s, a}|R_

Figures (3)

  • Figure 1: Objectives of the traditional two-stage IRL and the proposed simultaneous estimation approach of Bayesian model-based IRL.
  • Figure 2: Gridworld experiment results. Ground truth and estimated target state distributions (softmax of reward; Row 1) and sample path of estimated policy in estimated dynamics (Row 2) for two-stage and BM-IRL agents with $\lambda=[0.001, 0.5, 10]$. BM-IRL agents with higher $\lambda$ obtain more accurate reward estimates and commit fewer illegal transitions.
  • Figure 3: MuJoCo benchmark performance using 10 expert trajectories from the D4RL dataset. Bar heights and error bars represent the means and standard deviations of normalized scores, respectively, over 5 random seeds. Baseline algorithm performances are taken from zeng2023understanding.

Theorems & Definitions (8)

  • Theorem 3.1
  • Remark B.1
  • proof
  • Lemma B.2
  • proof
  • Lemma B.3
  • Theorem B.4
  • proof