Table of Contents
Fetching ...

Towards Generalized Inverse Reinforcement Learning

Chaosheng Dong, Yijia Wang

TL;DR

GIRL addresses learning all unknown MDP components from observed, potentially suboptimal behavior by jointly reconstructing a latent optimal policy and the MDP model using a policy-matrix formulation. The method minimizes the distance between the recovered policy and the observed policy $\|\mathbf{\Pi}-\mathbf{\Pi}_0\|_F$ under $M\!D\P$-consistency constraints, and scales to large state spaces via discretization and reward function approximation $R(s) \approx \sum_i \theta_i \phi_i(s)$. Empirical results on discrete and continuous grid worlds show that GIRL can recover unobserved actions/states, infer the reward and transition structures, and recover near-optimal policies despite observation noise. The work broadens IRL by enabling simultaneous inference of multiple, partially observable MDP components, with future directions including tackling unidentifiability through Bayesian or max-entropy techniques.

Abstract

This paper studies generalized inverse reinforcement learning (GIRL) in Markov decision processes (MDPs), that is, the problem of learning the basic components of an MDP given observed behavior (policy) that might not be optimal. These components include not only the reward function and transition probability matrices, but also the action space and state space that are not exactly known but are known to belong to given uncertainty sets. We address two key challenges in GIRL: first, the need to quantify the discrepancy between the observed policy and the underlying optimal policy; second, the difficulty of mathematically characterizing the underlying optimal policy when the basic components of an MDP are unobservable or partially observable. Then, we propose the mathematical formulation for GIRL and develop a fast heuristic algorithm. Numerical results on both finite and infinite state problems show the merit of our formulation and algorithm.

Towards Generalized Inverse Reinforcement Learning

TL;DR

GIRL addresses learning all unknown MDP components from observed, potentially suboptimal behavior by jointly reconstructing a latent optimal policy and the MDP model using a policy-matrix formulation. The method minimizes the distance between the recovered policy and the observed policy under -consistency constraints, and scales to large state spaces via discretization and reward function approximation . Empirical results on discrete and continuous grid worlds show that GIRL can recover unobserved actions/states, infer the reward and transition structures, and recover near-optimal policies despite observation noise. The work broadens IRL by enabling simultaneous inference of multiple, partially observable MDP components, with future directions including tackling unidentifiability through Bayesian or max-entropy techniques.

Abstract

This paper studies generalized inverse reinforcement learning (GIRL) in Markov decision processes (MDPs), that is, the problem of learning the basic components of an MDP given observed behavior (policy) that might not be optimal. These components include not only the reward function and transition probability matrices, but also the action space and state space that are not exactly known but are known to belong to given uncertainty sets. We address two key challenges in GIRL: first, the need to quantify the discrepancy between the observed policy and the underlying optimal policy; second, the difficulty of mathematically characterizing the underlying optimal policy when the basic components of an MDP are unobservable or partially observable. Then, we propose the mathematical formulation for GIRL and develop a fast heuristic algorithm. Numerical results on both finite and infinite state problems show the merit of our formulation and algorithm.
Paper Structure (11 sections, 2 theorems, 7 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 11 sections, 2 theorems, 7 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

Proposition 2.1

Given a finite state space $\mathcal{S}$, a finite action space $\mathcal{A}$, transition probability matrices $P$, and discount factor $\gamma \in (0,1)$, a deterministic policy $\pi$ is optimal if and only if the reward $\mathbf{R}$ satisfies

Figures (3)

  • Figure 1: IRL stands for the inverse reinforcement learning. GIRL stands for the generalized inverse reinforcement learning studied in our paper. The observed policy $\pi_{0}$ might be non-optimal.
  • Figure 2: Learning the reward function for discrete grid world. (a) The $5\times 5$ grid world with optimal policy. (b) Blocked grid world. (c) True reward. (d) Estimated reward.
  • Figure 3: Learning the reward function for continuous grid world. Number of noisy states is set to be 2. Figures at the top are the true rewards under the corresponding discretization. Figures on the bottom are the estimated rewards. (a) $10 \times 10$ discretization of the state space. (b) $20 \times 20$ discretization of the state space. (c) $30 \times 30$ discretization of the state space.

Theorems & Definitions (12)

  • Definition 2.1: Policy
  • Remark 2.1
  • Definition 2.2: Transition probability matrix
  • Proposition 2.1
  • Definition 3.1: Policy matrix
  • Remark 3.1
  • Definition 3.2: Distance between policies
  • Remark 3.2
  • Theorem 3.1
  • Remark 3.3
  • ...and 2 more