Table of Contents
Fetching ...

Inverse Reinforcement Learning with Multiple Planning Horizons

Jiayu Yao, Weiwei Pan, Finale Doshi-Velez, Barbara E Engelhardt

TL;DR

This work develops algorithms that can learn a global multi-agent reward function with agent-specific discount factors that reconstruct the expert policies and characterize the feasible solution space of the reward function and discount factors for both algorithms and demonstrate the generalizability of the learned reward function across multiple domains.

Abstract

In this work, we study an inverse reinforcement learning (IRL) problem where the experts are planning under a shared reward function but with different, unknown planning horizons. Without the knowledge of discount factors, the reward function has a larger feasible solution set, which makes it harder for existing IRL approaches to identify a reward function. To overcome this challenge, we develop algorithms that can learn a global multi-agent reward function with agent-specific discount factors that reconstruct the expert policies. We characterize the feasible solution space of the reward function and discount factors for both algorithms and demonstrate the generalizability of the learned reward function across multiple domains.

Inverse Reinforcement Learning with Multiple Planning Horizons

TL;DR

This work develops algorithms that can learn a global multi-agent reward function with agent-specific discount factors that reconstruct the expert policies and characterize the feasible solution space of the reward function and discount factors for both algorithms and demonstrate the generalizability of the learned reward function across multiple domains.

Abstract

In this work, we study an inverse reinforcement learning (IRL) problem where the experts are planning under a shared reward function but with different, unknown planning horizons. Without the knowledge of discount factors, the reward function has a larger feasible solution set, which makes it harder for existing IRL approaches to identify a reward function. To overcome this challenge, we develop algorithms that can learn a global multi-agent reward function with agent-specific discount factors that reconstruct the expert policies. We characterize the feasible solution space of the reward function and discount factors for both algorithms and demonstrate the generalizability of the learned reward function across multiple domains.
Paper Structure (44 sections, 9 theorems, 47 equations, 11 figures, 1 table, 2 algorithms)

This paper contains 44 sections, 9 theorems, 47 equations, 11 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

For a set of arbitrary distinct discount factors, $\Gamma$ ($\gamma_i\neq\gamma_j$ for $i\neq j$), let $\{z^*_k\}_{k=1}^K \ (z^*_k\in\mathbb{R}^{|\mathcal{S}|\times(|\mathcal{A}-1|)})$ be the optimal solution to the following LP problem, where $z_k(s,a)$ denotes the element of vector $z_k$ corresponding to the state-action tuple $(s,a)$. There exists a feasible reward solution $r$ that satisfies

Figures (11)

  • Figure 1: Plots of the value function of the initial state under (a) the true reward function $r^*$, (b) the learned reward function of MPLP-IRL, $\tilde{r}$: $x,\ y$-axes represent the discount factor $\gamma\in[0,1]$ and the value function of expert policies or reconstructed optimal policies, respectively. Each color represents a different policy. The dashed lines in (b) represent the learned discount factors, $\tilde{\Gamma}$. We see that MPLP-IRL recovers the order of true discount factors.
  • Figure 2: The table of the generalization error (Eq. \ref{['eqn:eval_gen']}) with one standard deviation of the learned reward function: each row and column represents a different algorithm and domain, respectively.
  • Figure 3: Trace plots of the best observed objective value of BO: $x,\ y$-axis represent the iteration and the best observed objective value, respectively. The red dashed line represents an approximate global maximum or the objective value under the ground truth.
  • Figure 4: Toy Domain
  • Figure 5: The reward function of the big-small domain of each state.
  • ...and 6 more figures

Theorems & Definitions (14)

  • Theorem 1
  • Theorem 2
  • Proposition 1
  • Proposition 2
  • proof
  • Corollary 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • ...and 4 more