Table of Contents
Fetching ...

Imitation Learning from Observation through Optimal Transport

Wei-Di Chang, Scott Fujimoto, David Meger, Gregory Dudek

TL;DR

This paper re-examine optimal transport for IL, in which a reward is generated based on the Wasserstein distance between the state trajectories of the learner and expert, and shows that existing methods can be simplified to generate a reward function without requiring learned models or adversarial learning.

Abstract

Imitation Learning from Observation (ILfO) is a setting in which a learner tries to imitate the behavior of an expert, using only observational data and without the direct guidance of demonstrated actions. In this paper, we re-examine optimal transport for IL, in which a reward is generated based on the Wasserstein distance between the state trajectories of the learner and expert. We show that existing methods can be simplified to generate a reward function without requiring learned models or adversarial learning. Unlike many other state-of-the-art methods, our approach can be integrated with any RL algorithm and is amenable to ILfO. We demonstrate the effectiveness of this simple approach on a variety of continuous control tasks and find that it surpasses the state of the art in the IlfO setting, achieving expert-level performance across a range of evaluation domains even when observing only a single expert trajectory without actions.

Imitation Learning from Observation through Optimal Transport

TL;DR

This paper re-examine optimal transport for IL, in which a reward is generated based on the Wasserstein distance between the state trajectories of the learner and expert, and shows that existing methods can be simplified to generate a reward function without requiring learned models or adversarial learning.

Abstract

Imitation Learning from Observation (ILfO) is a setting in which a learner tries to imitate the behavior of an expert, using only observational data and without the direct guidance of demonstrated actions. In this paper, we re-examine optimal transport for IL, in which a reward is generated based on the Wasserstein distance between the state trajectories of the learner and expert. We show that existing methods can be simplified to generate a reward function without requiring learned models or adversarial learning. Unlike many other state-of-the-art methods, our approach can be integrated with any RL algorithm and is amenable to ILfO. We demonstrate the effectiveness of this simple approach on a variety of continuous control tasks and find that it surpasses the state of the art in the IlfO setting, achieving expert-level performance across a range of evaluation domains even when observing only a single expert trajectory without actions.
Paper Structure (11 sections, 9 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 11 sections, 9 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Learning curves for 1 expert demonstrations across 5 random seeds. The shaded area represents a standard deviation. OOPS+TD3 consistently matches or outperforms the baseline approaches.
  • Figure 2: Calibration plot comparing the proxy reward with the original reward function of the benchmark domains. Each point represents the average of the sum of each reward function, over 5 trajectories. Trajectories are generated by adding noise $\mathcal{N}(0,\ell^2)$ to the expert policy. The calibration plots show a strong correlation between the proxy reward and the true task reward.
  • Figure 3: Wasserstein distances between the 10 final rollout trajectories of OOPS+TD3 and the expert on the Hopper environment, using different solvers for the coupling matrix $P$ ($W_{\text{greedy}}$ and $W_{\text{simplex}}$) compared against the Sinkhorn distance $W_{\text{Sk}}$ when varying the parameter $\lambda$. Results are averaged over 10 expert trajectories. The Sinkhorn distance, for low enough values of $\lambda$ computes a tighter upper bound to the Wasserstein distance estimates than $W_{\text{greedy}}$PWIL. Results for the other environments can be found in the Appendix.
  • Figure 4: Wasserstein distances between the 10 final rollout trajectories of OOPS+TD3 and the expert, using different solvers for the coupling matrix $P$ ($W_{\text{greedy}}$ and $W_{\text{simplex}}$) compared against the Sinkhorn distance $W_{\text{Sk}}$ when varying the parameter $\lambda$. Results are averaged over 10 expert trajectories. The Sinkhorn distance, for low enough values of $\lambda$ computes a tighter upper bound to the Wasserstein distance estimates than $W_{\text{greedy}}$PWIL.