Table of Contents
Fetching ...

Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods

Gergely Neu, Csaba Szepesvari

TL;DR

A novel gradient algorithm to learn a policy from an expert's observed behavior assuming that the expert behaves optimally with respect to some unknown reward function of a Markovian Decision Problem is proposed.

Abstract

In this paper we propose a novel gradient algorithm to learn a policy from an expert's observed behavior assuming that the expert behaves optimally with respect to some unknown reward function of a Markovian Decision Problem. The algorithm's aim is to find a reward function such that the resulting optimal policy matches well the expert's observed behavior. The main difficulty is that the mapping from the parameters to policies is both nonsmooth and highly redundant. Resorting to subdifferentials solves the first difficulty, while the second one is over- come by computing natural gradients. We tested the proposed method in two artificial domains and found it to be more reliable and efficient than some previous methods.

Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods

TL;DR

A novel gradient algorithm to learn a policy from an expert's observed behavior assuming that the expert behaves optimally with respect to some unknown reward function of a Markovian Decision Problem is proposed.

Abstract

In this paper we propose a novel gradient algorithm to learn a policy from an expert's observed behavior assuming that the expert behaves optimally with respect to some unknown reward function of a Markovian Decision Problem. The algorithm's aim is to find a reward function such that the resulting optimal policy matches well the expert's observed behavior. The main difficulty is that the mapping from the parameters to policies is both nonsmooth and highly redundant. Resorting to subdifferentials solves the first difficulty, while the second one is over- come by computing natural gradients. We tested the proposed method in two artificial domains and found it to be more reliable and efficient than some previous methods.

Paper Structure

This paper contains 10 sections, 15 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Performance as a function of the number of training samples. Each curve is an average of 10 runs using different samples, with$1 / 10$ s.e. error bars.
  • Figure 2: Performance with linearly transformed features. The features were transformed by a (nonsingular) square matrix with uniform$[0,1]$ random elements. Each curve is an average of 25 runs with different scalings of the features, the $1 / 10$ s.e error bars are also plotted.
  • Figure 3: Performance as a function of the number of training episodes. The fraction of states where the found policy differs from the actual optimal policy is plotted against the number of episodes observed., measured by the mean of 5 runs. The$1 / 2$-s.e. error bars are also plotted for both methods.