Table of Contents
Fetching ...

How does Inverse RL Scale to Large State Spaces? A Provably Efficient Approach

Filippo Lazzati, Mirco Mutti, Alberto Maria Metelli

TL;DR

This work tackles the challenge of scaling Inverse Reinforcement Learning to large state spaces by reframing IRL from recovering a single reward to classifying rewards by their compatibility with expert demonstrations. It introduces Rewards Compatibility and IRL Classification, enabling a provably efficient CATY-IRL algorithm that achieves sample- and computation-efficient guarantees in Linear MDPs and matches minimax bounds in the tabular setting. The paper also establishes tighter lower bounds for Reward-Free Exploration, shows that IRL and RFE share the same worst-case rate, and proposes Objective-Free Exploration to unify exploration across RL and IRL tasks. Together, these contributions provide a principled, scalable framework for IRL and highlight directions toward general function approximation and offline scenarios.

Abstract

In online Inverse Reinforcement Learning (IRL), the learner can collect samples about the dynamics of the environment to improve its estimate of the reward function. Since IRL suffers from identifiability issues, many theoretical works on online IRL focus on estimating the entire set of rewards that explain the demonstrations, named the feasible reward set. However, none of the algorithms available in the literature can scale to problems with large state spaces. In this paper, we focus on the online IRL problem in Linear Markov Decision Processes (MDPs). We show that the structure offered by Linear MDPs is not sufficient for efficiently estimating the feasible set when the state space is large. As a consequence, we introduce the novel framework of rewards compatibility, which generalizes the notion of feasible set, and we develop CATY-IRL, a sample efficient algorithm whose complexity is independent of the cardinality of the state space in Linear MDPs. When restricted to the tabular setting, we demonstrate that CATY-IRL is minimax optimal up to logarithmic factors. As a by-product, we show that Reward-Free Exploration (RFE) enjoys the same worst-case rate, improving over the state-of-the-art lower bound. Finally, we devise a unifying framework for IRL and RFE that may be of independent interest.

How does Inverse RL Scale to Large State Spaces? A Provably Efficient Approach

TL;DR

This work tackles the challenge of scaling Inverse Reinforcement Learning to large state spaces by reframing IRL from recovering a single reward to classifying rewards by their compatibility with expert demonstrations. It introduces Rewards Compatibility and IRL Classification, enabling a provably efficient CATY-IRL algorithm that achieves sample- and computation-efficient guarantees in Linear MDPs and matches minimax bounds in the tabular setting. The paper also establishes tighter lower bounds for Reward-Free Exploration, shows that IRL and RFE share the same worst-case rate, and proposes Objective-Free Exploration to unify exploration across RL and IRL tasks. Together, these contributions provide a principled, scalable framework for IRL and highlight directions toward general function approximation and offline scenarios.

Abstract

In online Inverse Reinforcement Learning (IRL), the learner can collect samples about the dynamics of the environment to improve its estimate of the reward function. Since IRL suffers from identifiability issues, many theoretical works on online IRL focus on estimating the entire set of rewards that explain the demonstrations, named the feasible reward set. However, none of the algorithms available in the literature can scale to problems with large state spaces. In this paper, we focus on the online IRL problem in Linear Markov Decision Processes (MDPs). We show that the structure offered by Linear MDPs is not sufficient for efficiently estimating the feasible set when the state space is large. As a consequence, we introduce the novel framework of rewards compatibility, which generalizes the notion of feasible set, and we develop CATY-IRL, a sample efficient algorithm whose complexity is independent of the cardinality of the state space in Linear MDPs. When restricted to the tabular setting, we demonstrate that CATY-IRL is minimax optimal up to logarithmic factors. As a by-product, we show that Reward-Free Exploration (RFE) enjoys the same worst-case rate, improving over the state-of-the-art lower bound. Finally, we devise a unifying framework for IRL and RFE that may be of independent interest.
Paper Structure (48 sections, 36 theorems, 118 equations, 5 figures, 1 table, 3 algorithms)

This paper contains 48 sections, 36 theorems, 118 equations, 5 figures, 1 table, 3 algorithms.

Key Result

Proposition 1

Let $\mathcal{M}$ be a Linear MDP without reward with a finite state space, and let $\phi$ be a feature mapping. Let $\{\Phi_h^{\pi^E}\}_{h\in \llbracket H\rrbracket}$ and $\{\overline{\Phi}_h\}_{h\in \llbracket H\rrbracket}$ be the sets of expert's and non-expert's features, defined for every $h \i where $\mathcal{A}^E_h(s)\coloneqq\{a\in\mathcal{A}|\pi^E_h(\cdot|s)>0\}$ for every $s \in \mathcal

Figures (5)

  • Figure 1: Flow-chart of CATY-IRL.
  • Figure 2: The axis represents (estimated) (non)compatibility values. (a) Rewards $r$ whose true (non)compatibility $\overline{\mathcal{C}}(r)\coloneqq\overline{\mathcal{C}}_{p,\pi^E}(r)$ is far from threshold $\Delta$ by at least $\epsilon$, are correctly classified, while (b) in the opposite case, rewards can be mis-classified. (c) The red interval $[\Delta-\epsilon,\Delta+\epsilon]$ exemplifies the set of rewards $\{r\in\mathcal{R}\,|\, |\overline{\mathcal{C}}(r)-\Delta|\le\epsilon\}$ that are (potentially) mis-classified. The length of the interval reduces with $\epsilon$.
  • Figure 3: In this figure, the point at the center represents the initial state $s_0= d_0$ of the environment $\mathcal{M}$, and each ray starting from it represents the occupancy measure $d^{p,\pi}$ of some policy $\pi$. The figure aims to provide the intuition that policies with rays close to each other induce similar visit distributions (e.g., both point towards the same direction in some grid-world), and policies with rays far away from each other point toward very different directions (i.e., they have different occupancy measures). The red area in the right denotes the set of directions (occupancy measures $d^{p,\pi}$ for some $\pi$) that are close in $\|\cdot\|_1$ norm to the direction of the expert $d^{p,\pi^E}$.
  • Figure 4: Hard instances.
  • Figure 5: Hard instances.

Theorems & Definitions (72)

  • Definition 3.1: Feasible Set lazzati2024offline
  • Proposition 1
  • Definition 3.2: PAC Algorithm
  • Theorem 3.1: Statistical Inefficiency
  • Theorem 3.2
  • Example 4.1: label=exa:cont
  • Definition 4.1: Rewards (non)Compatibility
  • Example 4.2: continues=exa:cont
  • Example 4.3
  • Definition 4.2: IRL Classification Problem and IRL Algorithm
  • ...and 62 more