Table of Contents
Fetching ...

Inverse Reinforcement Learning from Non-Stationary Learning Agents

Kavinayan P. Sivakumar, Yi Shen, Zachary Bell, Scott Nivison, Boyuan Chen, Michael M. Zavlanos

TL;DR

A theoretical analysis is provided to show a complexity result on bound guarantees for the method that beats standard behavior cloning and numerical experiments are provided for a reinforcement learning problem that validate the proposed method.

Abstract

In this paper, we study an inverse reinforcement learning problem that involves learning the reward function of a learning agent using trajectory data collected while this agent is learning its optimal policy. To address this problem, we propose an inverse reinforcement learning method that allows us to estimate the policy parameters of the learning agent which can then be used to estimate its reward function. Our method relies on a new variant of the behavior cloning algorithm, which we call bundle behavior cloning, and uses a small number of trajectories generated by the learning agent's policy at different points in time to learn a set of policies that match the distribution of actions observed in the sampled trajectories. We then use the cloned policies to train a neural network model that estimates the reward function of the learning agent. We provide a theoretical analysis to show a complexity result on bound guarantees for our method that beats standard behavior cloning as well as numerical experiments for a reinforcement learning problem that validate the proposed method.

Inverse Reinforcement Learning from Non-Stationary Learning Agents

TL;DR

A theoretical analysis is provided to show a complexity result on bound guarantees for the method that beats standard behavior cloning and numerical experiments are provided for a reinforcement learning problem that validate the proposed method.

Abstract

In this paper, we study an inverse reinforcement learning problem that involves learning the reward function of a learning agent using trajectory data collected while this agent is learning its optimal policy. To address this problem, we propose an inverse reinforcement learning method that allows us to estimate the policy parameters of the learning agent which can then be used to estimate its reward function. Our method relies on a new variant of the behavior cloning algorithm, which we call bundle behavior cloning, and uses a small number of trajectories generated by the learning agent's policy at different points in time to learn a set of policies that match the distribution of actions observed in the sampled trajectories. We then use the cloned policies to train a neural network model that estimates the reward function of the learning agent. We provide a theoretical analysis to show a complexity result on bound guarantees for our method that beats standard behavior cloning as well as numerical experiments for a reinforcement learning problem that validate the proposed method.

Paper Structure

This paper contains 11 sections, 5 theorems, 14 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

With probability at least $1-\delta$, we have that where $T$ is the number of samples of state action pairs from a single trajectory, $\Pi$ is the policy class $\Pi = \{\pi : \mathcal{S} \rightarrow \Delta(\mathcal{A})\}$ which is discrete with size $\vert \Pi \vert$.

Figures (6)

  • Figure 1: A visual illustration of Bundle Behavior Cloning. (a) Set of trajectories $\{\tau\}$ from total number of $E$ episodes during the forward step. (b) The first trajectory $\tau_0$ containing state action pairs for $T$ timesteps. (c) The last trajectory $\tau_E$ containing state action pairs for the last episode of learning. (d) The bundle $b_k$ of $M$ trajectories that are used to clone for a policy $\tilde{\pi}^k_\psi$. (e) The bundle of trajectories $b_k$ are sampled to get a distribution of individual agent actions per state $s$, $\rho_{b_k}(s)$.
  • Figure 2: The full pipeline of the algorithm presented in this paper. Note that the learner does not need to finish optimizing its policy before its reward function can be estimated using the algorithm.
  • Figure 3: Colormaps representing the true and learned reward functions using Algorithm $2$ with Bundle Behavior Cloning (a) Agent's normalized, true reward function. (b) Agent's normalized, learned reward function with the same neural network structure.
  • Figure 4: The state distribution in the forward case can affect the variance in reward estimation. (a) The state distribution of the forward trajectories in percentages. (b) Lower bound and (c) upper bound of predicted scaled, normalized rewards for $95\%$ confidence measure. States with higher percentages of visitation correlate with less variance in estimated rewards.
  • Figure 5: Only using the first 50 bundles ($750$ episodes) to learn the reward function.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Lemma 1: Theorem 21 in agarwal2020flambe
  • Lemma 2: Lemma 3 in achiam2017constrained
  • Lemma 3
  • proof
  • Corollary 1
  • proof
  • Theorem 1
  • proof