Table of Contents
Fetching ...

Reward Compatibility: A Framework for Inverse RL

Filippo Lazzati, Mirco Mutti, Alberto Metelli

TL;DR

This work reframes inverse reinforcement learning through reward compatibility, replacing the brittle feasible set with a graded measure of how well a reward aligns with an expert’s demonstrations. By defining the (non)compatibility as the suboptimality gap $J^*(r;p)-J^{\pi^E}(r;p)$, the authors show how IRL can be cast as an IRL classification problem and solved with provably efficient algorithms. They introduce CATY-IRL for online settings and CATY-OFF-IRL for offline tabular IRL, providing concrete sample complexity bounds that scale favorably in large or continuous state spaces, including Linear MDPs. The framework is extended to suboptimal experts, multiple environments, and robust offline notions, highlighting practical pathways for reward learning and forward-RL applicability while outlining key limitations and future research directions.

Abstract

We provide an original theoretical study of Inverse Reinforcement Learning (IRL) through the lens of reward compatibility, a novel framework to quantify the compatibility of a reward with the given expert's demonstrations. Intuitively, a reward is more compatible with the demonstrations the closer the performance of the expert's policy computed with that reward is to the optimal performance for that reward. This generalizes the notion of feasible reward set, the most common framework in the theoretical IRL literature, for which a reward is either compatible or not compatible. The grayscale introduced by the reward compatibility is the key to extend the realm of provably efficient IRL far beyond what is attainable with the feasible reward set: from tabular to large-scale MDPs. We analyze the IRL problem across various settings, including optimal and suboptimal expert's demonstrations and both online and offline data collection. For all of these dimensions, we provide a tractable algorithm and corresponding sample complexity analysis, as well as various insights on reward compatibility and how the framework can pave the way to yet more general problem settings.

Reward Compatibility: A Framework for Inverse RL

TL;DR

This work reframes inverse reinforcement learning through reward compatibility, replacing the brittle feasible set with a graded measure of how well a reward aligns with an expert’s demonstrations. By defining the (non)compatibility as the suboptimality gap , the authors show how IRL can be cast as an IRL classification problem and solved with provably efficient algorithms. They introduce CATY-IRL for online settings and CATY-OFF-IRL for offline tabular IRL, providing concrete sample complexity bounds that scale favorably in large or continuous state spaces, including Linear MDPs. The framework is extended to suboptimal experts, multiple environments, and robust offline notions, highlighting practical pathways for reward learning and forward-RL applicability while outlining key limitations and future research directions.

Abstract

We provide an original theoretical study of Inverse Reinforcement Learning (IRL) through the lens of reward compatibility, a novel framework to quantify the compatibility of a reward with the given expert's demonstrations. Intuitively, a reward is more compatible with the demonstrations the closer the performance of the expert's policy computed with that reward is to the optimal performance for that reward. This generalizes the notion of feasible reward set, the most common framework in the theoretical IRL literature, for which a reward is either compatible or not compatible. The grayscale introduced by the reward compatibility is the key to extend the realm of provably efficient IRL far beyond what is attainable with the feasible reward set: from tabular to large-scale MDPs. We analyze the IRL problem across various settings, including optimal and suboptimal expert's demonstrations and both online and offline data collection. For all of these dimensions, we provide a tractable algorithm and corresponding sample complexity analysis, as well as various insights on reward compatibility and how the framework can pave the way to yet more general problem settings.
Paper Structure (39 sections, 11 theorems, 88 equations, 4 figures, 1 table, 3 algorithms)

This paper contains 39 sections, 11 theorems, 88 equations, 4 figures, 1 table, 3 algorithms.

Key Result

Theorem 3.1

Let $\mathfrak{A}$ be a PAC algorithm for the IRL setting with optimal expert. Then, there exists a Linear MDP without reward $\mathcal{M}_\phi$ with a state space with finite but arbitrarily large cardinality $S$, and a deterministic expert's policy $\pi^E$, in which $\mathfrak{A}$ requires at leas

Figures (4)

  • Figure 1: (Left) In the framework of the feasible set, all the rewards in $\mathcal{R}_{\mathcal{M},\pi^E}^\complement$ (in pink), i.e., outside the feasible set $\mathcal{R}_{\mathcal{M},\pi^E}$ (in white), are considered equally wrong. (Right) In the framework of the reward compatibility, the rewards outside the feasible set $\mathcal{R}_{\mathcal{M},\pi^E}$ suffer from different errors (scale of pink).
  • Figure 2: The set of rewards positively classified by an IRL algorithm $\mathcal{R}_\Delta$ with $\Delta>0$ represents an enlargement of the feasible set $\mathcal{R}^\star$, i.e., $\mathcal{R}^\star\subseteq\mathcal{R}_\Delta$. Visually, $\mathcal{R}_\Delta$ contains rewards outside $\mathcal{R}^\star$ whose magnitude of pink is not too intense.
  • Figure 3: The axis represents (non)compatibility values $\overline{\mathcal{C}}(\cdot)$ and we consider threshold $\eta=\Delta$. (a) Rewards $r$ with $\overline{\mathcal{C}}(r)\le\Delta-\epsilon$ or $\overline{\mathcal{C}}(r)\ge\Delta+\epsilon$ are correctly classified by an $(\epsilon,\delta)$-PAC with high probability, while (b) in the opposite case, $r$ can be mis-classified. (c) The red interval $[\Delta-\epsilon,\Delta+\epsilon]$ exemplifies the set of rewards $\{r\in\mathcal{R}\,|\, |\overline{\mathcal{C}}(r)-\Delta|\le\epsilon\}$ that are (potentially) mis-classified. The length of the interval reduces with $\epsilon$.
  • Figure :

Theorems & Definitions (30)

  • Definition 3.1: Feasible Set for Optimal Expert
  • Definition 3.2: Feasible Set for Subptimal Expert
  • Definition 3.3: Feasible Set for Optimal Expert in Linear MDPs
  • Definition 3.4: Feasible Set for Subptimal Expert in Linear MDPs
  • Definition 3.5: PAC Algorithm
  • Theorem 3.1: Statistical Inefficiency for Optimal Expert
  • Theorem 3.2: Statistical Inefficiency for Subptimal Expert
  • Example 4.1: label=exa:cont
  • Definition 4.1: Reward (non)Compatibility - Optimal Expert
  • Example 4.2: continues=exa:cont
  • ...and 20 more