Table of Contents
Fetching ...

Sample-Efficient Imitation Learning via Generative Adversarial Nets

Lionel Blondé, Alexandros Kalousis

TL;DR

The paper addresses the high sample complexity of Generative Adversarial Imitation Learning (GAIL) by introducing Sam, a Sample-efficient Adversarial Imitation Learning framework. Sam uses an off-policy, TD-based approach with deterministic policies and a triad of interacting modules: a discriminator-based reward, a critic, and a policy, all trained with a replay buffer to reuse past experience. By combining a gradient from the learned reward with TD-based policy evaluation and a carefully designed exploration strategy, Sam substantially reduces the number of environment interactions needed to achieve expert-like performance while maintaining stability. This has practical implications for real-world robotics and other domains where costly or risky environment interactions are a bottleneck, while preserving model-free training and adversarial intuition fidelity.

Abstract

GAIL is a recent successful imitation learning architecture that exploits the adversarial training procedure introduced in GANs. Albeit successful at generating behaviours similar to those demonstrated to the agent, GAIL suffers from a high sample complexity in the number of interactions it has to carry out in the environment in order to achieve satisfactory performance. We dramatically shrink the amount of interactions with the environment necessary to learn well-behaved imitation policies, by up to several orders of magnitude. Our framework, operating in the model-free regime, exhibits a significant increase in sample-efficiency over previous methods by simultaneously a) learning a self-tuned adversarially-trained surrogate reward and b) leveraging an off-policy actor-critic architecture. We show that our approach is simple to implement and that the learned agents remain remarkably stable, as shown in our experiments that span a variety of continuous control tasks. Video visualisations available at: \url{https://youtu.be/-nCsqUJnRKU}.

Sample-Efficient Imitation Learning via Generative Adversarial Nets

TL;DR

The paper addresses the high sample complexity of Generative Adversarial Imitation Learning (GAIL) by introducing Sam, a Sample-efficient Adversarial Imitation Learning framework. Sam uses an off-policy, TD-based approach with deterministic policies and a triad of interacting modules: a discriminator-based reward, a critic, and a policy, all trained with a replay buffer to reuse past experience. By combining a gradient from the learned reward with TD-based policy evaluation and a carefully designed exploration strategy, Sam substantially reduces the number of environment interactions needed to achieve expert-like performance while maintaining stability. This has practical implications for real-world robotics and other domains where costly or risky environment interactions are a bottleneck, while preserving model-free training and adversarial intuition fidelity.

Abstract

GAIL is a recent successful imitation learning architecture that exploits the adversarial training procedure introduced in GANs. Albeit successful at generating behaviours similar to those demonstrated to the agent, GAIL suffers from a high sample complexity in the number of interactions it has to carry out in the environment in order to achieve satisfactory performance. We dramatically shrink the amount of interactions with the environment necessary to learn well-behaved imitation policies, by up to several orders of magnitude. Our framework, operating in the model-free regime, exhibits a significant increase in sample-efficiency over previous methods by simultaneously a) learning a self-tuned adversarially-trained surrogate reward and b) leveraging an off-policy actor-critic architecture. We show that our approach is simple to implement and that the learned agents remain remarkably stable, as shown in our experiments that span a variety of continuous control tasks. Video visualisations available at: \url{https://youtu.be/-nCsqUJnRKU}.

Paper Structure

This paper contains 14 sections, 6 equations, 2 figures, 1 algorithm.

Figures (2)

  • Figure 1: Inter-module relationships in different neural architectures (the scope of this figure was inspired from Pfau2016-ft). Modules with distinct loss functions are depicted with empty circles, while filled circles designate environmental entities. Solid and dotted arrows respectively represent (forward) flow of information and (backward) flow of gradient. a) Generative Adversarial Imitation Learning Ho2016-bv b) Actor-Critic architecture Sutton1999-ii c) Sam (this work). Note that in Sam, the critic takes in information from the reward module, while in the vanilla AC architecture, the critic receives the reward from the environment. The gradient flow from the critic to the reward module must however be sealed. Indeed, such a gradient flow would allow the policy to adjust its parameters to induce values of the reward which yield low TD residuals, hence preventing both critic and reward modules to be learned as intended.
  • Figure 2: Performance comparison between Sam and GAIL in terms of episodic return. The horizontal axis depicts, in logarithmic scale, the number of interactions with the environment. While there is no ambiguity for GAIL, we used the unperturbed Sam policy $\mu_\theta$ (without parameter noise and additive action noise) to collect those returns during a per-iteration evaluation phase. The figure shows that our method has a considerably better sample-efficiency than GAIL in various continuous control tasks, often by several orders of magnitude. Red-colored lines and filled areas indicate the performance range of the expert demonstrations present in the training set. The meaning of the different line styles and colors is given in-text.