Imitation Learning from Observation through Optimal Transport

Wei-Di Chang; Scott Fujimoto; David Meger; Gregory Dudek

Imitation Learning from Observation through Optimal Transport

Wei-Di Chang, Scott Fujimoto, David Meger, Gregory Dudek

TL;DR

This paper re-examine optimal transport for IL, in which a reward is generated based on the Wasserstein distance between the state trajectories of the learner and expert, and shows that existing methods can be simplified to generate a reward function without requiring learned models or adversarial learning.

Abstract

Imitation Learning from Observation (ILfO) is a setting in which a learner tries to imitate the behavior of an expert, using only observational data and without the direct guidance of demonstrated actions. In this paper, we re-examine optimal transport for IL, in which a reward is generated based on the Wasserstein distance between the state trajectories of the learner and expert. We show that existing methods can be simplified to generate a reward function without requiring learned models or adversarial learning. Unlike many other state-of-the-art methods, our approach can be integrated with any RL algorithm and is amenable to ILfO. We demonstrate the effectiveness of this simple approach on a variety of continuous control tasks and find that it surpasses the state of the art in the IlfO setting, achieving expert-level performance across a range of evaluation domains even when observing only a single expert trajectory without actions.

Imitation Learning from Observation through Optimal Transport

TL;DR

Abstract

Paper Structure (11 sections, 9 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 11 sections, 9 equations, 4 figures, 6 tables, 1 algorithm.

Introduction
Background
Related Work
Wasserstein Imitation Learning from Observational Demonstrations
Experiments
Results
Analysis and Ablations
Conclusion
Additional Results and Experiments
Comparing Solvers for the State Transition Wasserstein Distance
Experimental Details

Figures (4)

Figure 1: Learning curves for 1 expert demonstrations across 5 random seeds. The shaded area represents a standard deviation. OOPS+TD3 consistently matches or outperforms the baseline approaches.
Figure 2: Calibration plot comparing the proxy reward with the original reward function of the benchmark domains. Each point represents the average of the sum of each reward function, over 5 trajectories. Trajectories are generated by adding noise $\mathcal{N}(0,\ell^2)$ to the expert policy. The calibration plots show a strong correlation between the proxy reward and the true task reward.
Figure 3: Wasserstein distances between the 10 final rollout trajectories of OOPS+TD3 and the expert on the Hopper environment, using different solvers for the coupling matrix $P$ ($W_{\text{greedy}}$ and $W_{\text{simplex}}$) compared against the Sinkhorn distance $W_{\text{Sk}}$ when varying the parameter $\lambda$. Results are averaged over 10 expert trajectories. The Sinkhorn distance, for low enough values of $\lambda$ computes a tighter upper bound to the Wasserstein distance estimates than $W_{\text{greedy}}$PWIL. Results for the other environments can be found in the Appendix.
Figure 4: Wasserstein distances between the 10 final rollout trajectories of OOPS+TD3 and the expert, using different solvers for the coupling matrix $P$ ($W_{\text{greedy}}$ and $W_{\text{simplex}}$) compared against the Sinkhorn distance $W_{\text{Sk}}$ when varying the parameter $\lambda$. Results are averaged over 10 expert trajectories. The Sinkhorn distance, for low enough values of $\lambda$ computes a tighter upper bound to the Wasserstein distance estimates than $W_{\text{greedy}}$PWIL.

Imitation Learning from Observation through Optimal Transport

TL;DR

Abstract

Imitation Learning from Observation through Optimal Transport

Authors

TL;DR

Abstract

Table of Contents

Figures (4)