Table of Contents
Fetching ...

Sample-efficient Adversarial Imitation Learning

Dahuin Jung, Hyungyu Lee, Sungroh Yoon

TL;DR

This work tackles sample inefficiency in imitation learning by coupling self-supervised representation learning with adversarial imitation. It learns temporally predictive state/action representations and employs a novel swapping corruption to generate diverse, in-distribution distortions, all integrated into a GAIL-style objective. Theoretical motivation connects reduced VC dimension and MI-based contrastive learning to improved generalization, and extensive experiments on MuJoCo and Atari RAM show strong gains, including a $39 ext{\%}$ relative improvement with $N_E=100$ expert pairs. The approach scales to imperfect demonstrations and discrete control, though it increases model complexity and compute requirements. Overall, the method demonstrates substantial advances in sample-efficient imitation with robust representations and principled auxiliary tasks.

Abstract

Imitation learning, in which learning is performed by demonstration, has been studied and advanced for sequential decision-making tasks in which a reward function is not predefined. However, imitation learning methods still require numerous expert demonstration samples to successfully imitate an expert's behavior. To improve sample efficiency, we utilize self-supervised representation learning, which can generate vast training signals from the given data. In this study, we propose a self-supervised representation-based adversarial imitation learning method to learn state and action representations that are robust to diverse distortions and temporally predictive, on non-image control tasks. In particular, in comparison with existing self-supervised learning methods for tabular data, we propose a different corruption method for state and action representations that is robust to diverse distortions. We theoretically and empirically observe that making an informative feature manifold with less sample complexity significantly improves the performance of imitation learning. The proposed method shows a 39% relative improvement over existing adversarial imitation learning methods on MuJoCo in a setting limited to 100 expert state-action pairs. Moreover, we conduct comprehensive ablations and additional experiments using demonstrations with varying optimality to provide insights into a range of factors.

Sample-efficient Adversarial Imitation Learning

TL;DR

This work tackles sample inefficiency in imitation learning by coupling self-supervised representation learning with adversarial imitation. It learns temporally predictive state/action representations and employs a novel swapping corruption to generate diverse, in-distribution distortions, all integrated into a GAIL-style objective. Theoretical motivation connects reduced VC dimension and MI-based contrastive learning to improved generalization, and extensive experiments on MuJoCo and Atari RAM show strong gains, including a relative improvement with expert pairs. The approach scales to imperfect demonstrations and discrete control, though it increases model complexity and compute requirements. Overall, the method demonstrates substantial advances in sample-efficient imitation with robust representations and principled auxiliary tasks.

Abstract

Imitation learning, in which learning is performed by demonstration, has been studied and advanced for sequential decision-making tasks in which a reward function is not predefined. However, imitation learning methods still require numerous expert demonstration samples to successfully imitate an expert's behavior. To improve sample efficiency, we utilize self-supervised representation learning, which can generate vast training signals from the given data. In this study, we propose a self-supervised representation-based adversarial imitation learning method to learn state and action representations that are robust to diverse distortions and temporally predictive, on non-image control tasks. In particular, in comparison with existing self-supervised learning methods for tabular data, we propose a different corruption method for state and action representations that is robust to diverse distortions. We theoretically and empirically observe that making an informative feature manifold with less sample complexity significantly improves the performance of imitation learning. The proposed method shows a 39% relative improvement over existing adversarial imitation learning methods on MuJoCo in a setting limited to 100 expert state-action pairs. Moreover, we conduct comprehensive ablations and additional experiments using demonstrations with varying optimality to provide insights into a range of factors.
Paper Structure (27 sections, 4 theorems, 22 equations, 2 figures, 14 tables, 3 algorithms)

This paper contains 27 sections, 4 theorems, 22 equations, 2 figures, 14 tables, 3 algorithms.

Key Result

Corollary 1

$h_{\text{MSE}}$ is effective in reducing the VC dimension, however, the problem is that the upper bound of $R(f|\mathcal{P})$ we seek to is based on $\hat{R}(f|\mathcal{P})$ instead of $\hat{R}(f_{h_{\text{MSE}}}|\mathcal{P})$. As a result, $f_{h_{\text{MSE}}}$ cannot capture all the properties pre

Figures (2)

  • Figure 1: Overview of the proposed model. Our proposed model comprises six networks during IRL training. 1) The policy generates actions $a$ based on states $s$ according to a specified policy; 2) The value function evaluates the current policy $\pi_{\theta}$. It is trained using rewards $r$ derived from an estimated cost function (discriminator); 3) The state encoder extracts a feature representation of raw states $s$, 4) The action encoder extracts a feature representation of actions $a$, 5) The forward dynamics model predicts the feature representation of the distorted version of next states $\hat{z}^s_{t+1}$ based on the feature representations of the current state and action, $z^s_t$ and $z^a_t$, along with Gaussian noises $\mathcal{N}$, and 6) The discriminator discriminates agent demonstrations from expert demonstrations. The input is $z^s \oplus z^a$. The discriminator is also referred to as a cost function. More details of each component and loss in the figure are described in Section \ref{['sec:4']}.
  • Figure 2: Final performance on five continuous control benchmarks with different optimality rates $\psi$. Vertical axes denote cumulative rewards acquired during the last 1000 training iterations. Shaded regions denote standard errors over three runs. Ours* = Ours + MM

Theorems & Definitions (10)

  • Definition 1
  • Corollary 1
  • Definition 2
  • Claim 1
  • Theorem 1
  • Corollary 1
  • Definition 1
  • Definition 2
  • Theorem 2
  • Claim 1