Table of Contents
Fetching ...

Cross-Domain Imitation Learning via Optimal Transport

Arnaud Fickinger, Samuel Cohen, Stuart Russell, Brandon Amos

TL;DR

Cross-domain imitation learning is tackled by GWIL, which leverages the Gromov-Wasserstein distance $\mathcal{GW}$ to compare occupancy measures across incomparable state–action spaces. A proxy reward $r_{\mathcal{GW}}$ is constructed from the optimal coupling to train policies via RL, enabling imitation without proxy tasks. Theoretical results show that minimizing $\mathcal{GW}$ recovers an optimal policy up to an isometry under suitable metric and embedding conditions. Empirically, a single expert trajectory suffices to achieve near-optimal behavior across rigid, mildly transformed, and highly transformed domains, demonstrating scalable cross-domain transfer in continuous control. This framework expands the applicability of imitation learning by removing the need for paired demonstrations or proxy tasks, with potential impact on transferring skills between humans and robots with different morphologies.

Abstract

Cross-domain imitation learning studies how to leverage expert demonstrations of one agent to train an imitation agent with a different embodiment or morphology. Comparing trajectories and stationary distributions between the expert and imitation agents is challenging because they live on different systems that may not even have the same dimensionality. We propose Gromov-Wasserstein Imitation Learning (GWIL), a method for cross-domain imitation that uses the Gromov-Wasserstein distance to align and compare states between the different spaces of the agents. Our theory formally characterizes the scenarios where GWIL preserves optimality, revealing its possibilities and limitations. We demonstrate the effectiveness of GWIL in non-trivial continuous control domains ranging from simple rigid transformation of the expert domain to arbitrary transformation of the state-action space.

Cross-Domain Imitation Learning via Optimal Transport

TL;DR

Cross-domain imitation learning is tackled by GWIL, which leverages the Gromov-Wasserstein distance to compare occupancy measures across incomparable state–action spaces. A proxy reward is constructed from the optimal coupling to train policies via RL, enabling imitation without proxy tasks. Theoretical results show that minimizing recovers an optimal policy up to an isometry under suitable metric and embedding conditions. Empirically, a single expert trajectory suffices to achieve near-optimal behavior across rigid, mildly transformed, and highly transformed domains, demonstrating scalable cross-domain transfer in continuous control. This framework expands the applicability of imitation learning by removing the need for paired demonstrations or proxy tasks, with potential impact on transferring skills between humans and robots with different morphologies.

Abstract

Cross-domain imitation learning studies how to leverage expert demonstrations of one agent to train an imitation agent with a different embodiment or morphology. Comparing trajectories and stationary distributions between the expert and imitation agents is challenging because they live on different systems that may not even have the same dimensionality. We propose Gromov-Wasserstein Imitation Learning (GWIL), a method for cross-domain imitation that uses the Gromov-Wasserstein distance to align and compare states between the different spaces of the agents. Our theory formally characterizes the scenarios where GWIL preserves optimality, revealing its possibilities and limitations. We demonstrate the effectiveness of GWIL in non-trivial continuous control domains ranging from simple rigid transformation of the expert domain to arbitrary transformation of the state-action space.

Paper Structure

This paper contains 14 sections, 3 theorems, 26 equations, 9 figures, 1 algorithm.

Key Result

Proposition 1

$\mathcal{GW}$ defines a metric on the collection of all isometry classes of policies.

Figures (9)

  • Figure 1: The Gromov-Wasserstein distance enables us to compare the stationary state-action distributions of two agents with different dynamics and state-action spaces. We use it as a pseudo-reward for cross-domain imitation learning.
  • Figure 2: Isometric policies (\ref{['def:2']}) have the same pairwise distances within the state-action space of the stationary distributions. In Euclidean spaces, isometric transformations preserve these pairwise distances and include rotations, translations, and reflections.
  • Figure 3: Given a single expert trajectory in the expert's domain (a), GWIL recovers an optimal policy in the agent's domain (b) without any external reward, as predicted by \ref{['the:1']}. The green dot represents the initial state position and the episode ends when the agent reaches the goal represented by the red square.
  • Figure 4: Given a single expert trajectory in the pendulum's domain (above), GWIL recovers the optimal behavior in the agent's domain (cartpole, below) without any external reward.
  • Figure 5: Given a single expert trajectory in the cheetah's domain (above), GWIL recovers the two elements of the optimal policy's isometry class in the agent's domain (walker), moving forward which is optimal (middle) and moving backward which is suboptimal (below). Interestingly, the resulting walker behaves like a cheetah.
  • ...and 4 more figures

Theorems & Definitions (11)

  • Definition 1: Gromov-Wasserstein distance between policies
  • Definition 2: Isometric policies
  • Proposition 1
  • proof
  • Theorem 1
  • proof
  • Remark 1
  • Definition 3
  • Proposition 2
  • proof
  • ...and 1 more