A Dual Approach to Imitation Learning from Observations with Offline Datasets

Harshit Sikchi; Caleb Chuck; Amy Zhang; Scott Niekum

A Dual Approach to Imitation Learning from Observations with Offline Datasets

Harshit Sikchi, Caleb Chuck, Amy Zhang, Scott Niekum

TL;DR

DILO (Dual Imitation Learning from Observations), an algorithm that can leverage arbitrary suboptimal data to learn imitating policies without requiring expert actions, is derived and shown to gracefully scale to high dimensional observations, and demonstrate improved performance across the board.

Abstract

Demonstrations are an effective alternative to task specification for learning agents in settings where designing a reward function is difficult. However, demonstrating expert behavior in the action space of the agent becomes unwieldy when robots have complex, unintuitive morphologies. We consider the practical setting where an agent has a dataset of prior interactions with the environment and is provided with observation-only expert demonstrations. Typical learning from observations approaches have required either learning an inverse dynamics model or a discriminator as intermediate steps of training. Errors in these intermediate one-step models compound during downstream policy learning or deployment. We overcome these limitations by directly learning a multi-step utility function that quantifies how each action impacts the agent's divergence from the expert's visitation distribution. Using the principle of duality, we derive DILO (Dual Imitation Learning from Observations), an algorithm that can leverage arbitrary suboptimal data to learn imitating policies without requiring expert actions. DILO reduces the learning from observations problem to that of simply learning an actor and a critic, bearing similar complexity to vanilla offline RL. This allows DILO to gracefully scale to high dimensional observations, and demonstrate improved performance across the board. Project page (code and videos): $\href{https://hari-sikchi.github.io/dilo/}{\text{hari-sikchi.github.io/dilo/}}$

A Dual Approach to Imitation Learning from Observations with Offline Datasets

TL;DR

Abstract

Paper Structure (23 sections, 1 theorem, 26 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 1 theorem, 26 equations, 11 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Dual Imitation Learning from Observations
LfO as $\{s,s'\}$ Joint Visitation Distribution Matching
DILO: Leveraging Action-free Offline Interactions for Imitating Expert Observations
Policy Extraction and Practical Algorithm
Experiments
Offline Imitation from Observation Benchmarking
Imitating from Expert Image Observations
Imitating from Human Trajectories for Robot Manipulation
Conclusion
Theory
Derivation for Action-free distribution matching
What does the utility function $V^*(s,s')$ represent?
...and 8 more sections

Key Result

Theorem 6.1

The dual problem to the primal occupancy matching objective (Equation eq:primal_dilo_f_mixture) is given by the DILO objective in Equation eq:dilo. Moreover, as strong duality holds from Slater's conditions the primal and dual share the same optimal solution $d^*$ for any offline transition distrib

Figures (11)

Figure 1: DILO Method Overview: Classical offline LfO methods require learning a Discriminator/IDM prior to the RL/BC step suffering from compounding errors during training/deployment respectively. DILO directly learns multi-step utility $V^*(s,s')$ of transitioning to next state in minimizing cumulative divergence with an expert avoiding errors arising due to using learned intermediate models for subsequent optimization.
Figure 2: Side-by-side comparison of LfO methods on state-only imitation vs image-only imitation. DILO shows noticeable improvement over existing LfO methods without hyperparameter tuning. Columns denote different suboptimal datasets.
Figure 3: Real Robot Experiments: Table shows the (x/y) success rates as x successes in y trials for different methods on real-robot setup of air-hockey. For the dynamic puck-hitting task, we evaluate the number of touches made in addition to hitting behavior, which returns the puck in the opposite direction.
Figure 4: Example of learned hitting behavior across algorithms: Puck's (red) gradient shows movement across time for Dynamics Puck Hitting.
Figure 5: Tasks: Left: Place object and avoid obstacles. Center: Stationary Puck Striking. Right: Dynamic Puck Hitting
...and 6 more figures

Theorems & Definitions (1)

Theorem 6.1

A Dual Approach to Imitation Learning from Observations with Offline Datasets

TL;DR

Abstract

A Dual Approach to Imitation Learning from Observations with Offline Datasets

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (1)