Table of Contents
Fetching ...

TRAIL: Near-Optimal Imitation Learning with Suboptimal Data

Mengjiao Yang, Sergey Levine, Ofir Nachum

TL;DR

TRAIL addresses the challenge of leveraging abundant suboptimal offline data to improve imitation learning when near-optimal expert data are scarce. It learns a factored transition model and a low-dimensional latent action space from offline data, then performs imitation learning in the latent space with a reparameterization φ, enabling sample-efficient BC. The authors derive a bound on imitation error decomposed into transition representation error, decoding error, and latent BC error, and show improved sample complexity under certain conditions; they propose TRAIL with EBM or linear transitions. Empirical results on AntMaze, locomotion, and DeepMind Control Suite demonstrate substantial improvements over vanilla BC and robustness to highly suboptimal offline data, often rivaling offline RL without rewards. This suggests action-representation learning from offline dynamics is a productive alternative for offline sequential decision making.

Abstract

The aim in imitation learning is to learn effective policies by utilizing near-optimal expert demonstrations. However, high-quality demonstrations from human experts can be expensive to obtain in large numbers. On the other hand, it is often much easier to obtain large quantities of suboptimal or task-agnostic trajectories, which are not useful for direct imitation, but can nevertheless provide insight into the dynamical structure of the environment, showing what could be done in the environment even if not what should be done. We ask the question, is it possible to utilize such suboptimal offline datasets to facilitate provably improved downstream imitation learning? In this work, we answer this question affirmatively and present training objectives that use offline datasets to learn a factored transition model whose structure enables the extraction of a latent action space. Our theoretical analysis shows that the learned latent action space can boost the sample-efficiency of downstream imitation learning, effectively reducing the need for large near-optimal expert datasets through the use of auxiliary non-expert data. To learn the latent action space in practice, we propose TRAIL (Transition-Reparametrized Actions for Imitation Learning), an algorithm that learns an energy-based transition model contrastively, and uses the transition model to reparametrize the action space for sample-efficient imitation learning. We evaluate the practicality of our objective through experiments on a set of navigation and locomotion tasks. Our results verify the benefits suggested by our theory and show that TRAIL is able to improve baseline imitation learning by up to 4x in performance.

TRAIL: Near-Optimal Imitation Learning with Suboptimal Data

TL;DR

TRAIL addresses the challenge of leveraging abundant suboptimal offline data to improve imitation learning when near-optimal expert data are scarce. It learns a factored transition model and a low-dimensional latent action space from offline data, then performs imitation learning in the latent space with a reparameterization φ, enabling sample-efficient BC. The authors derive a bound on imitation error decomposed into transition representation error, decoding error, and latent BC error, and show improved sample complexity under certain conditions; they propose TRAIL with EBM or linear transitions. Empirical results on AntMaze, locomotion, and DeepMind Control Suite demonstrate substantial improvements over vanilla BC and robustness to highly suboptimal offline data, often rivaling offline RL without rewards. This suggests action-representation learning from offline dynamics is a productive alternative for offline sequential decision making.

Abstract

The aim in imitation learning is to learn effective policies by utilizing near-optimal expert demonstrations. However, high-quality demonstrations from human experts can be expensive to obtain in large numbers. On the other hand, it is often much easier to obtain large quantities of suboptimal or task-agnostic trajectories, which are not useful for direct imitation, but can nevertheless provide insight into the dynamical structure of the environment, showing what could be done in the environment even if not what should be done. We ask the question, is it possible to utilize such suboptimal offline datasets to facilitate provably improved downstream imitation learning? In this work, we answer this question affirmatively and present training objectives that use offline datasets to learn a factored transition model whose structure enables the extraction of a latent action space. Our theoretical analysis shows that the learned latent action space can boost the sample-efficiency of downstream imitation learning, effectively reducing the need for large near-optimal expert datasets through the use of auxiliary non-expert data. To learn the latent action space in practice, we propose TRAIL (Transition-Reparametrized Actions for Imitation Learning), an algorithm that learns an energy-based transition model contrastively, and uses the transition model to reparametrize the action space for sample-efficient imitation learning. We evaluate the practicality of our objective through experiments on a set of navigation and locomotion tasks. Our results verify the benefits suggested by our theory and show that TRAIL is able to improve baseline imitation learning by up to 4x in performance.

Paper Structure

This paper contains 38 sections, 12 theorems, 49 equations, 6 figures, 1 table.

Key Result

Theorem 1

Consider an action representation function $\phi:S\times A\to Z$, a factored transition model $\mathcal{T}_Z:S\times Z\to\Delta(S)$, an action decoder $\pi_{\alpha}:S\times Z\to\Delta(A)$, and a tabular latent policy $\pi_{Z}:S\to\Delta(Z)$. Define the transition representation error as the action decoding error as and the latent behavioral cloning error as Then the TV divergence between the st

Figures (6)

  • Figure 1: The TRAIL framework. Pretraining learns a factored transition model $\mathcal{T}_Z\circ\phi$ and an action decoder $\pi_{\alpha}$ on $\mathcal{D}^\mathrm{off}$. Downstream imitation learns a latent policy $\pi_{Z}$ on $\mathcal{D}^{\pi_*}$ with expert actions reparametrized by $\phi$. During inference, $\pi_{Z}$ and $\pi_{\alpha}$ are combined to sample an action.
  • Figure 2: Tasks for our empirical evaluation. We include the challenging AntMaze navigation tasks from D4RL fu2020d4rl and low (1-DoF) to high (21-DoF) dimensional locomotaion tasks from DeepMind Control Suite tassa2018deepmind.
  • Figure 3: Average success rate ($\%$) over $4$ seeds of TRAIL EBM (Theorem \ref{['thm:tabular']}) and temporal skill extraction methods -- SkiLD pertsch2021guided, SPiRL pertsch2020accelerating, and OPAL ajay2020opal -- pretrained on suboptimal $\mathcal{D}^\mathrm{off}$. Baseline BC corresponds to direct behavioral cloning of expert $\mathcal{D}^{\pi_*}$ without latent actions.
  • Figure 4: Average rewards (over $4$ seeds) of TRAIL EBM (Theorem \ref{['thm:tabular']}), TRAIL linear (Theorem \ref{['thm:linear']}), and baseline methods when using a variety of unimodal (ant-medium), low-quality (ant-medium-replay), and random (ant-random) offline datasets $\mathcal{D}^\mathrm{off}$ paired with a smaller expert dataset $\mathcal{D}^{\pi_*}$ (either $10$k or $25$k expert transitions).
  • Figure 5: Average task rewards (over $4$ seeds) of TRAIL EBM (Theorem \ref{['thm:tabular']}), TRAIL linear (Theorem \ref{['thm:linear']}), and OPAL (other temporal methods are included in Appendix \ref{['app:results']}) pretrained on the bottom $80\%$ of the RL Unplugged datasets followed by behavioral cloning in the latent action space on $\frac{1}{10}$ of the top $20\%$ of the RL Unplugged datasets following the setup in zolna2020offline. Baseline BC achieves low rewards due to the small expert sample size. Dotted lines denote the performance of CRR wang2020critic, an offline RL method trained on the full RL Unplugged datasets with reward labels.
  • ...and 1 more figures

Theorems & Definitions (23)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • Lemma 6
  • proof
  • Lemma 7
  • ...and 13 more