Inverse Reinforcement Learning with Multiple Planning Horizons

Jiayu Yao; Weiwei Pan; Finale Doshi-Velez; Barbara E Engelhardt

Inverse Reinforcement Learning with Multiple Planning Horizons

Jiayu Yao, Weiwei Pan, Finale Doshi-Velez, Barbara E Engelhardt

TL;DR

This work develops algorithms that can learn a global multi-agent reward function with agent-specific discount factors that reconstruct the expert policies and characterize the feasible solution space of the reward function and discount factors for both algorithms and demonstrate the generalizability of the learned reward function across multiple domains.

Abstract

In this work, we study an inverse reinforcement learning (IRL) problem where the experts are planning under a shared reward function but with different, unknown planning horizons. Without the knowledge of discount factors, the reward function has a larger feasible solution set, which makes it harder for existing IRL approaches to identify a reward function. To overcome this challenge, we develop algorithms that can learn a global multi-agent reward function with agent-specific discount factors that reconstruct the expert policies. We characterize the feasible solution space of the reward function and discount factors for both algorithms and demonstrate the generalizability of the learned reward function across multiple domains.

Inverse Reinforcement Learning with Multiple Planning Horizons

TL;DR

Abstract

Paper Structure (44 sections, 9 theorems, 47 equations, 11 figures, 1 table, 2 algorithms)

This paper contains 44 sections, 9 theorems, 47 equations, 11 figures, 1 table, 2 algorithms.

Introduction
Related Work
Problem Setting
Markov decision processes (MDPs).
Multi-planning horizon IRL (MP-IRL).
Algorithms for MP-IRL: LP-IRL
Naive Extension of LP-IRL Fails
Multi-planning horizon LP-IRL (MPLP-IRL)
Inference for MPLP-IRL
Algorithms for multi-planning horizon IRL: MCE-IRL
Strong duality does not hold for multi-planning horizon MCE-IRL (MPMCE-IRL)
Inference for multi-planning horizon MCE-IRL
Feasibilty and Identifiability Analysis for Inference
Experiments and Results
Domains
...and 29 more sections

Key Result

Theorem 1

For a set of arbitrary distinct discount factors, $\Gamma$ ($\gamma_i\neq\gamma_j$ for $i\neq j$), let $\{z^*_k\}_{k=1}^K \ (z^*_k\in\mathbb{R}^{|\mathcal{S}|\times(|\mathcal{A}-1|)})$ be the optimal solution to the following LP problem, where $z_k(s,a)$ denotes the element of vector $z_k$ corresponding to the state-action tuple $(s,a)$. There exists a feasible reward solution $r$ that satisfies

Figures (11)

Figure 1: Plots of the value function of the initial state under (a) the true reward function $r^*$, (b) the learned reward function of MPLP-IRL, $\tilde{r}$: $x,\ y$-axes represent the discount factor $\gamma\in[0,1]$ and the value function of expert policies or reconstructed optimal policies, respectively. Each color represents a different policy. The dashed lines in (b) represent the learned discount factors, $\tilde{\Gamma}$. We see that MPLP-IRL recovers the order of true discount factors.
Figure 2: The table of the generalization error (Eq. \ref{['eqn:eval_gen']}) with one standard deviation of the learned reward function: each row and column represents a different algorithm and domain, respectively.
Figure 3: Trace plots of the best observed objective value of BO: $x,\ y$-axis represent the iteration and the best observed objective value, respectively. The red dashed line represents an approximate global maximum or the objective value under the ground truth.
Figure 4: Toy Domain
Figure 5: The reward function of the big-small domain of each state.
...and 6 more figures

Theorems & Definitions (14)

Theorem 1
Theorem 2
Proposition 1
Proposition 2
proof
Corollary 1
Theorem 1
proof
Theorem 2
proof
...and 4 more

Inverse Reinforcement Learning with Multiple Planning Horizons

TL;DR

Abstract

Inverse Reinforcement Learning with Multiple Planning Horizons

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (14)