Table of Contents
Fetching ...

Wasserstein Adversarial Imitation Learning

Huang Xiao, Michael Herman, Joerg Wagner, Sebastian Ziesche, Jalal Etesami, Thai Hong Linh

TL;DR

This work links imitation learning and inverse reinforcement learning to optimal transport by interpreting Kantorovich potentials as rewards and employing regularized OT to obtain scalable, smooth reward functions. The resulting Wasserstein Adversarial Imitation Learning (WAIL) uses a neural-network parameterized reward and entropic OT to drive policy optimization, achieving strong sample efficiency and often requiring only a single expert demonstration. The method outperforms baselines like GAIL and behavioral cloning across a suite of robotic control tasks, with theoretical convergence guarantees under KL-constraint conditions. The approach offers a principled, transferable framework for imitation that leverages OT geometry to produce stable, interpretable rewards and scalable training dynamics.

Abstract

Imitation Learning describes the problem of recovering an expert policy from demonstrations. While inverse reinforcement learning approaches are known to be very sample-efficient in terms of expert demonstrations, they usually require problem-dependent reward functions or a (task-)specific reward-function regularization. In this paper, we show a natural connection between inverse reinforcement learning approaches and Optimal Transport, that enables more general reward functions with desirable properties (e.g., smoothness). Based on our observation, we propose a novel approach called Wasserstein Adversarial Imitation Learning. Our approach considers the Kantorovich potentials as a reward function and further leverages regularized optimal transport to enable large-scale applications. In several robotic experiments, our approach outperforms the baselines in terms of average cumulative rewards and shows a significant improvement in sample-efficiency, by requiring just one expert demonstration.

Wasserstein Adversarial Imitation Learning

TL;DR

This work links imitation learning and inverse reinforcement learning to optimal transport by interpreting Kantorovich potentials as rewards and employing regularized OT to obtain scalable, smooth reward functions. The resulting Wasserstein Adversarial Imitation Learning (WAIL) uses a neural-network parameterized reward and entropic OT to drive policy optimization, achieving strong sample efficiency and often requiring only a single expert demonstration. The method outperforms baselines like GAIL and behavioral cloning across a suite of robotic control tasks, with theoretical convergence guarantees under KL-constraint conditions. The approach offers a principled, transferable framework for imitation that leverages OT geometry to produce stable, interpretable rewards and scalable training dynamics.

Abstract

Imitation Learning describes the problem of recovering an expert policy from demonstrations. While inverse reinforcement learning approaches are known to be very sample-efficient in terms of expert demonstrations, they usually require problem-dependent reward functions or a (task-)specific reward-function regularization. In this paper, we show a natural connection between inverse reinforcement learning approaches and Optimal Transport, that enables more general reward functions with desirable properties (e.g., smoothness). Based on our observation, we propose a novel approach called Wasserstein Adversarial Imitation Learning. Our approach considers the Kantorovich potentials as a reward function and further leverages regularized optimal transport to enable large-scale applications. In several robotic experiments, our approach outperforms the baselines in terms of average cumulative rewards and shows a significant improvement in sample-efficiency, by requiring just one expert demonstration.

Paper Structure

This paper contains 12 sections, 4 theorems, 19 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Proposition 2.1

(see Theorem 2. of Syed:2008:ALU:1390156.1390286)) The mapping $\rho \mapsto \pi_\rho$ defined by $\pi_\rho(a \mid s) := \rho(s, a) / \sum_{a' \in \mathcal{A}} \rho(s, a')$ is a bijection between $\Pi$ and the set $\mathcal{B}$ of measures on $\mathcal{S}\times\mathcal{A}$ satisfying the Bellmann eq

Figures (6)

  • Figure 1: Imitation performance of Wail, Gail and Bc on $9$ control tasks with respect to different expert data sizes. The performance is the average cumulative reward over $50$ trajectories and scaled in $\left\lbrack 0,1\right\rbrack$ with respect to expert and random policy performance.
  • Figure 2: Reward surfaces of Wail and Gail on Humanoid with respect to different expert data sizes.
  • Figure 3: Training curves of Wail for all control tasks with respect to different expert data sizes.
  • Figure 4: Reward surfaces of Wail and Gail on $9$ control tasks with respect to different expert data sizes.
  • Figure :
  • ...and 1 more figures

Theorems & Definitions (4)

  • Proposition 2.1
  • Proposition 3.1
  • Proposition 3.2
  • Theorem 4.1