Table of Contents
Fetching ...

Towards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement Learning

Andreas Schlaginhaufen, Maryam Kamgarpour

TL;DR

This work shows that the conditions developed under full access to the expert's policy cannot guarantee transferability in the more practical scenario where the authors have access only to demonstrations of the expert, and proposes principal angles as a more refined measure of similarity and dissimilarity between transition laws.

Abstract

Inverse reinforcement learning (IRL) aims to infer a reward from expert demonstrations, motivated by the idea that the reward, rather than the policy, is the most succinct and transferable description of a task [Ng et al., 2000]. However, the reward corresponding to an optimal policy is not unique, making it unclear if an IRL-learned reward is transferable to new transition laws in the sense that its optimal policy aligns with the optimal policy corresponding to the expert's true reward. Past work has addressed this problem only under the assumption of full access to the expert's policy, guaranteeing transferability when learning from two experts with the same reward but different transition laws that satisfy a specific rank condition [Rolland et al., 2022]. In this work, we show that the conditions developed under full access to the expert's policy cannot guarantee transferability in the more practical scenario where we have access only to demonstrations of the expert. Instead of a binary rank condition, we propose principal angles as a more refined measure of similarity and dissimilarity between transition laws. Based on this, we then establish two key results: 1) a sufficient condition for transferability to any transition laws when learning from at least two experts with sufficiently different transition laws, and 2) a sufficient condition for transferability to local changes in the transition law when learning from a single expert. Furthermore, we also provide a probably approximately correct (PAC) algorithm and an end-to-end analysis for learning transferable rewards from demonstrations of multiple experts.

Towards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement Learning

TL;DR

This work shows that the conditions developed under full access to the expert's policy cannot guarantee transferability in the more practical scenario where the authors have access only to demonstrations of the expert, and proposes principal angles as a more refined measure of similarity and dissimilarity between transition laws.

Abstract

Inverse reinforcement learning (IRL) aims to infer a reward from expert demonstrations, motivated by the idea that the reward, rather than the policy, is the most succinct and transferable description of a task [Ng et al., 2000]. However, the reward corresponding to an optimal policy is not unique, making it unclear if an IRL-learned reward is transferable to new transition laws in the sense that its optimal policy aligns with the optimal policy corresponding to the expert's true reward. Past work has addressed this problem only under the assumption of full access to the expert's policy, guaranteeing transferability when learning from two experts with the same reward but different transition laws that satisfy a specific rank condition [Rolland et al., 2022]. In this work, we show that the conditions developed under full access to the expert's policy cannot guarantee transferability in the more practical scenario where we have access only to demonstrations of the expert. Instead of a binary rank condition, we propose principal angles as a more refined measure of similarity and dissimilarity between transition laws. Based on this, we then establish two key results: 1) a sufficient condition for transferability to any transition laws when learning from at least two experts with sufficiently different transition laws, and 2) a sufficient condition for transferability to local changes in the transition law when learning from a single expert. Furthermore, we also provide a probably approximately correct (PAC) algorithm and an end-to-end analysis for learning transferable rewards from demonstrations of multiple experts.
Paper Structure (56 sections, 29 theorems, 114 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 56 sections, 29 theorems, 114 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Proposition 3.2

Under Assumptions ass:steep_regularization and ass:occ_lower_bound, we have $\mathsf{SubOpt}(r', \mu) = D_{\bar{h}}(\mu, \mathsf{RL}(r'))$ for any $\mu\in\mathcal{M}$.

Figures (4)

  • Figure 1: (a) illustrates the equivalence classes $[\hat{r}]_{\mathcal{U}}$ and $[r^{\textsf{E}}]_{\mathcal{U}}$, corresponding to the transition laws $P^0, P^1, P$ from Example \ref{['ex1']}, for a small $\beta$, in $\mathbb{R}^{\mathcal{S}\times\mathcal{A}}/\bm 1$. The blue lines correspond to $P^0$, the red lines to $P^1$, and the gray lines to $P$. Furthermore, the shaded areas illustrate the approximation error around $[r^{\textsf{E}}]_{\mathcal{U}_{P^k}}$, as guaranteed by Lemma \ref{['lem:bregman_bound']}. (b) illustrates the uncertainty set for the recovered reward when learning from two experts, as discussed in the proof sketch of Theorem \ref{['thm:global_transferability']}.
  • Figure 2: (a) shows the second principal angle between the experts, for varying wind strength $\beta$. (b) shows the distance between $\hat{r}$ and $r^{\textsf{E}}$ in $\mathbb{R}^{\mathcal{S}\times\mathcal{A}}/\bm 1$ for a varying number of expert demonstrations $N^{\textsf{E}}$ and wind strength $\beta$. (c) and (d) show the transferability to $P^{\text{South}}$ and $P^{\text{Shifted}}$ in terms of $\mathsf{SubOpt}_{P^{\text{South}}}(r^{\textsf{E}}, \mathsf{RL}_{P^{\text{South}}}(\hat{r}))$ and $\mathsf{SubOpt}_{P^{\text{Shifted}}}(r^{\textsf{E}}, \mathsf{RL}_{P^{\text{Shifted}}}(\hat{r}))$, respectively. The circles indicate the median and the shaded areas the 0.2 and 0.8 quantiles over 10 independent realizations of the expert data.
  • Figure 3: The set of occupancy measures $\mathcal{M}_{P^0}$ and $\mathcal{M}_{P^1}$ are illustrated in $\mathbb{R}^{\mathcal{S}\times\mathcal{A}}/\bm 1\cong \bm 1^\perp$. For a two-state-two-action MDP, the set of occupancy measures is given by the intersection of a two-dimensional affine subspace (a plane in $\mathbb{R}^{\mathcal{S}\times\mathcal{A}}/\bm 1$) with the probability simplex in $\mathbb{R}^4$ (a tetrahedron in $\mathbb{R}^{\mathcal{S}\times\mathcal{A}}/\bm 1$). We see that for a small $\beta$, the sets $\mathcal{M}_{P^0}$ and $\mathcal{M}_{P^1}$ are approximately parallel. That is, the angle between their normal vectors, which span the potential shaping spaces $\mathcal{U}_{P^0}$ and $\mathcal{U}_{P^1}$, is small. In contrast, for a large $\beta$ the orientation of $\mathcal{M}_{P^0}$ and $\mathcal{M}_{P^1}$ is very different, resulting in a large angle between the corresponding normal vectors.
  • Figure 4: (a) shows the second principal angle between $P_{\beta}^0$ and $P_{\beta}^1$ for varying wind strength $\beta$. Furthermore, (b) shows the distance between $\hat{r}$ and $r^{\textsf{E}}$ in $\mathbb{R}^{\mathcal{S}\times\mathcal{A}}/\bm 1$ for a varying number of expert demonstrations $N^{\textsf{E}}$ and wind strength $\beta$. Moreover, (c) and (d) show the transferability to $P^{\text{South}}$ and $P^{\text{Shifted}}$ in terms of $\mathsf{SubOpt}_{P^{\text{South}}}(r^{\textsf{E}}, \mathsf{RL}_{P^{\text{South}}}(\hat{r}))$ and $\mathsf{SubOpt}_{P^{\text{Shifted}}}(r^{\textsf{E}}, \mathsf{RL}_{P^{\text{Shifted}}}(\hat{r}))$, respectively. The dots indicate the median and the shaded areas the 0.2 and 0.8 quantiles over the 10 independent realizations.

Theorems & Definitions (65)

  • Remark 2.1
  • Definition 3.0: $\varepsilon$-transferability
  • Remark 3.1
  • Example 3.2
  • Proposition 3.2
  • Lemma 3.2
  • Remark 3.3
  • Example 3.4: continues=ex1
  • Definition 3.4: Principal angles galantai2013projectors
  • Proposition 3.4
  • ...and 55 more