Table of Contents
Fetching ...

Align Your Intents: Offline Imitation Learning via Optimal Transport

Maksim Bobrin, Nazar Buzun, Dmitrii Krylov, Dmitry V. Dylov

TL;DR

AILOT outperforms state-of-the art offline imitation learning algorithms on D4RL benchmarks and improves the performance of other offline RL algorithms by dense reward relabelling in the sparse-reward tasks.

Abstract

Offline Reinforcement Learning (RL) addresses the problem of sequential decision-making by learning optimal policy through pre-collected data, without interacting with the environment. As yet, it has remained somewhat impractical, because one rarely knows the reward explicitly and it is hard to distill it retrospectively. Here, we show that an imitating agent can still learn the desired behavior merely from observing the expert, despite the absence of explicit rewards or action labels. In our method, AILOT (Aligned Imitation Learning via Optimal Transport), we involve special representation of states in a form of intents that incorporate pairwise spatial distances within the data. Given such representations, we define intrinsic reward function via optimal transport distance between the expert's and the agent's trajectories. We report that AILOT outperforms state-of-the art offline imitation learning algorithms on D4RL benchmarks and improves the performance of other offline RL algorithms by dense reward relabelling in the sparse-reward tasks.

Align Your Intents: Offline Imitation Learning via Optimal Transport

TL;DR

AILOT outperforms state-of-the art offline imitation learning algorithms on D4RL benchmarks and improves the performance of other offline RL algorithms by dense reward relabelling in the sparse-reward tasks.

Abstract

Offline Reinforcement Learning (RL) addresses the problem of sequential decision-making by learning optimal policy through pre-collected data, without interacting with the environment. As yet, it has remained somewhat impractical, because one rarely knows the reward explicitly and it is hard to distill it retrospectively. Here, we show that an imitating agent can still learn the desired behavior merely from observing the expert, despite the absence of explicit rewards or action labels. In our method, AILOT (Aligned Imitation Learning via Optimal Transport), we involve special representation of states in a form of intents that incorporate pairwise spatial distances within the data. Given such representations, we define intrinsic reward function via optimal transport distance between the expert's and the agent's trajectories. We report that AILOT outperforms state-of-the art offline imitation learning algorithms on D4RL benchmarks and improves the performance of other offline RL algorithms by dense reward relabelling in the sparse-reward tasks.
Paper Structure (30 sections, 1 theorem, 14 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 30 sections, 1 theorem, 14 equations, 5 figures, 9 tables, 1 algorithm.

Key Result

Proposition 1

Consider all possible pairs of current and goal states ($s, s_{+}$) with $z = \psi(s_+)$. Assume that functions $\phi(s)$, $\psi(s_{+})$, $T(\psi(s_{+}))$ (decomposition of $V(s, s_+, z)$ from eq. eq:icvf) and expression are upper-bounded by some constant value. Here $\psi_0$ is some intermediate value between $\psi(s)$ and $\psi(s_{+})$. Then from convergence of intent embedding $\psi(s) \to \ps

Figures (5)

  • Figure 1: AILOT: Aligned Imitation Learning via Optimal Transport. Principal diagram for intents alignment of states in two different trajectories (in blue and red). Left: Stage I: projection into intent space (denoted by $\psi$ in text) (first $3$ principal components are shown). Middle: Stage II: computation of intrinsic rewards for offline RL, where $s_i^a$ and $s_j^e$ are the expert's and the agent's states, $P$ is the optimal coupling matrix; corresponding intrinsic reward $r(s_i^a)$ is a scaling transform of the product $\sum_j P_{ij} C_{ij}$ with some cost function $C$ defined on the intents pairs. Right: squared norm vs. steps count in the same trajectory for the states and the intents differences; It demonstrates that distance between intents is proportional to the total path length (steps count) between the states (AntMaze example is shown).
  • Figure 2: Top: Sample trajectory from the agent dataset, showcasing backflip task. Bottom: Hopper agent successfully performs imitation of backflip from the observations via AILOT. Refer to the Supplementary material for the animations.
  • Figure 3: The squared norm of states and intents differences, depending on the total steps count between the states in the same trajectory. The intents differences $||\psi(s_{t+k}) - \psi(s_t)||^2$ has a near-linear dependence on the steps count. The squared norm in the state space is not a monotone function, which is less efficient for training an imitating agent, since it completely ignores global geometric dependencies between states in the dataset.
  • Figure 4: Principal diagram of AILOT approach for imitation learning.
  • Figure 5: Top: HalfCheetah expert sample trajectory, performing an upwards standing which the agent should imitate; Bottom: HalfCheetah agent successfully performs imitation of standing upwards from the observations via AILOT.

Theorems & Definitions (1)

  • Proposition 1