Table of Contents
Fetching ...

Bayesian Robust Optimization for Imitation Learning

Daniel S. Brown, Scott Niekum, Marek Petrik

TL;DR

BROIL leverages Bayesian reward function inference and a user specific risk tolerance to efficiently optimize a robust policy that balances expected return and conditional value at risk and outperforms existing risk-sensitive and risk-neutral inverse reinforcement learning algorithms.

Abstract

One of the main challenges in imitation learning is determining what action an agent should take when outside the state distribution of the demonstrations. Inverse reinforcement learning (IRL) can enable generalization to new states by learning a parameterized reward function, but these approaches still face uncertainty over the true reward function and corresponding optimal policy. Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework that optimizes a policy under the assumption of an adversarial reward function, whereas risk-neutral IRL approaches either optimize a policy for the mean or MAP reward function. While completely ignoring risk can lead to overly aggressive and unsafe policies, optimizing in a fully adversarial sense is also problematic as it can lead to overly conservative policies that perform poorly in practice. To provide a bridge between these two extremes, we propose Bayesian Robust Optimization for Imitation Learning (BROIL). BROIL leverages Bayesian reward function inference and a user specific risk tolerance to efficiently optimize a robust policy that balances expected return and conditional value at risk. Our empirical results show that BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors and outperforms existing risk-sensitive and risk-neutral inverse reinforcement learning algorithms. Code is available at https://github.com/dsbrown1331/broil.

Bayesian Robust Optimization for Imitation Learning

TL;DR

BROIL leverages Bayesian reward function inference and a user specific risk tolerance to efficiently optimize a robust policy that balances expected return and conditional value at risk and outperforms existing risk-sensitive and risk-neutral inverse reinforcement learning algorithms.

Abstract

One of the main challenges in imitation learning is determining what action an agent should take when outside the state distribution of the demonstrations. Inverse reinforcement learning (IRL) can enable generalization to new states by learning a parameterized reward function, but these approaches still face uncertainty over the true reward function and corresponding optimal policy. Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework that optimizes a policy under the assumption of an adversarial reward function, whereas risk-neutral IRL approaches either optimize a policy for the mean or MAP reward function. While completely ignoring risk can lead to overly aggressive and unsafe policies, optimizing in a fully adversarial sense is also problematic as it can lead to overly conservative policies that perform poorly in practice. To provide a bridge between these two extremes, we propose Bayesian Robust Optimization for Imitation Learning (BROIL). BROIL leverages Bayesian reward function inference and a user specific risk tolerance to efficiently optimize a robust policy that balances expected return and conditional value at risk. Our empirical results show that BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors and outperforms existing risk-sensitive and risk-neutral inverse reinforcement learning algorithms. Code is available at https://github.com/dsbrown1331/broil.

Paper Structure

This paper contains 24 sections, 25 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: $\mathop{\mathrm{VaR_\alpha}}\nolimits$ measures the $(1-\alpha)$-quantile worst-case outcome in a distribution. $\mathop{\mathrm{CVaR_\alpha}}\nolimits$ measures the expectation given that we only consider values less than the $\mathop{\mathrm{VaR_\alpha}}\nolimits$.
  • Figure 2: Machine Replacement MDP
  • Figure 3: Risk-sensitive ($\lambda \in [0,1)$) and risk-neutral ($\lambda=1$) policies for the machine replacement problem. Varying $\lambda$ results in a family of solutions that trade-off conditional value at risk and return. The risk-neutral policy has heavy tails, while BROIL produces risk-sensitive policies that trade-off a small decrease in expected return for a large increase in robustness (CVaR).
  • Figure 4: When demonstrations BROIL results in a family of solutions that balance return and risk based on the value of $\lambda$. (a) Ambiguous demonstration that does not convey enough information to determine how undesireable the red states are. (b-c) MaxEnt IRL and LPAL results in stochastic policies where size of arrow reprents probability. (d) The robust policy with $\lambda = 0$ balances the goodness and badness of red and prefers taking a shortcut. (e-g) The regret policy avoids red for small $\lambda$. (h) The optimal policy for the mean reward ($\lambda=1$) takes a short cut through red cells.
  • Figure 5: Sorted return distributions over the posterior for the BROIL Robust and Baseline Regret policies compared to the return distributions of the demonstration, MaxEnt IRL ziebart2008maximum, LPAL syed2008apprenticeship. The robust policy attempts to maximize worst-case performance over the posterior. The baseline regret also seeks to maximize worst-case performance but relative to the demonstration.
  • ...and 1 more figures