Table of Contents
Fetching ...

Model Predictive Adversarial Imitation Learning for Planning from Observation

Tyler Han, Yanda Bao, Bhaumik Mehta, Gabriel Guo, Anubhav Vishwakarma, Emily Kang, Sanghun Jung, Rosario Scalise, Jason Zhou, Bryan Xu, Byron Boots

TL;DR

This study derives a replacement of the policy in IRL with a planning-based agent and enables end-to-end interactive learning of planners from observation-only demonstrations, and study and observe significant improvements on sample efficiency, out-of-distribution generalization, and robustness.

Abstract

Human demonstration data is often ambiguous and incomplete, motivating imitation learning approaches that also exhibit reliable planning behavior. A common paradigm to perform planning-from-demonstration involves learning a reward function via Inverse Reinforcement Learning (IRL) then deploying this reward via Model Predictive Control (MPC). Towards unifying these methods, we derive a replacement of the policy in IRL with a planning-based agent. With connections to Adversarial Imitation Learning, this formulation enables end-to-end interactive learning of planners from observation-only demonstrations. In addition to benefits in interpretability, complexity, and safety, we study and observe significant improvements on sample efficiency, out-of-distribution generalization, and robustness. The study includes evaluations in both simulated control benchmarks and real-world navigation experiments using few-to-single observation-only demonstrations.

Model Predictive Adversarial Imitation Learning for Planning from Observation

TL;DR

This study derives a replacement of the policy in IRL with a planning-based agent and enables end-to-end interactive learning of planners from observation-only demonstrations, and study and observe significant improvements on sample efficiency, out-of-distribution generalization, and robustness.

Abstract

Human demonstration data is often ambiguous and incomplete, motivating imitation learning approaches that also exhibit reliable planning behavior. A common paradigm to perform planning-from-demonstration involves learning a reward function via Inverse Reinforcement Learning (IRL) then deploying this reward via Model Predictive Control (MPC). Towards unifying these methods, we derive a replacement of the policy in IRL with a planning-based agent. With connections to Adversarial Imitation Learning, this formulation enables end-to-end interactive learning of planners from observation-only demonstrations. In addition to benefits in interpretability, complexity, and safety, we study and observe significant improvements on sample efficiency, out-of-distribution generalization, and robustness. The study includes evaluations in both simulated control benchmarks and real-world navigation experiments using few-to-single observation-only demonstrations.

Paper Structure

This paper contains 25 sections, 28 equations, 12 figures, 5 tables, 3 algorithms.

Figures (12)

  • Figure 1: Model Predictive Adversarial Imitation Learning (MPAIL) learns costs for a planning-based, Model Predictive Control (MPC) agent from observation-only demonstration. Interactions with these costs are simultaneously used to learn a value function for experience-based reasoning beyond the horizon of the planner.
  • Figure 2: Illustration of $\pi_\text{MPPI}$ in MPAIL. (1) A set of action sequences (plans) are sampled and rolled out. (2) Plans are costed according to the discriminator, shifting the distribution towards the expert. Temperature $\lambda$ optionally decays over episodes, narrowing the distribution. (3) The policy $\pi_{\text{MPPI}}$ is the result of a Gaussian fit to the optimized plans and their respective first actions.
  • Figure 3: Four Expert Trajectories in Navigation Task. Cars initialized around $(0,0)$.
  • Figure 4: Comparison of policy-based and planning-based AIL in Out-of-Distribution (OOD) states. Agents trained on the navigation task (\ref{['sec:nav-task']}) are placed uniformly with random orientation between a 40 $\times$ 40 m box centered on $(0,0)$. The policy and planner are run for 100 timesteps in the environment. Data support of the expert exists mainly between $(0,0)$ and $(10, 10)$. Quantitative evaluation of this experiment can be found in \ref{['fig:ood-quant']}. A comparison which includes a learned dynamics model can be found in \ref{['fig:ood-om']}
  • Figure 5: OOD Navigation Evaluation. Agent initial poses vary from In-distribution (ID) to OOD relative to the expert data and are plotted with their final reward after 100 timesteps. Metric from liu_energy-based_2020 (see \ref{['app:ood']}).
  • ...and 7 more figures

Theorems & Definitions (5)

  • proof
  • proof
  • proof
  • proof
  • proof