Table of Contents
Fetching ...

Auto-Encoding Adversarial Imitation Learning

Kaifeng Zhang, Rui Zhao, Ziming Zhang, Yang Gao

TL;DR

AEAIL tackles the reliance on environment rewards in reinforcement learning by replacing the discriminator-based reward in adversarial imitation learning with a reconstruction-error reward from an auto-encoder. This surrogate enables denser, more informative feedback and yields robustness to noisy expert demonstrations, while its theoretical grounding ties the objective to the Wasserstein distance between expert and learned policies. Empirically, AEAIL surpasses state-of-the-art baselines on state- and image-based tasks and maintains strong performance across diverse auto-encoder types and distance divergences. The approach offers a simple, scalable, and robust pathway for imitation learning in high-dimensional settings with imperfect demonstrations.

Abstract

Reinforcement learning (RL) provides a powerful framework for decision-making, but its application in practice often requires a carefully designed reward function. Adversarial Imitation Learning (AIL) sheds light on automatic policy acquisition without access to the reward signal from the environment. In this work, we propose Auto-Encoding Adversarial Imitation Learning (AEAIL), a robust and scalable AIL framework. To induce expert policies from demonstrations, AEAIL utilizes the reconstruction error of an auto-encoder as a reward signal, which provides more information for optimizing policies than the prior discriminator-based ones. Subsequently, we use the derived objective functions to train the auto-encoder and the agent policy. Experiments show that our AEAIL performs superior compared to state-of-the-art methods on both state and image based environments. More importantly, AEAIL shows much better robustness when the expert demonstrations are noisy.

Auto-Encoding Adversarial Imitation Learning

TL;DR

AEAIL tackles the reliance on environment rewards in reinforcement learning by replacing the discriminator-based reward in adversarial imitation learning with a reconstruction-error reward from an auto-encoder. This surrogate enables denser, more informative feedback and yields robustness to noisy expert demonstrations, while its theoretical grounding ties the objective to the Wasserstein distance between expert and learned policies. Empirically, AEAIL surpasses state-of-the-art baselines on state- and image-based tasks and maintains strong performance across diverse auto-encoder types and distance divergences. The approach offers a simple, scalable, and robust pathway for imitation learning in high-dimensional settings with imperfect demonstrations.

Abstract

Reinforcement learning (RL) provides a powerful framework for decision-making, but its application in practice often requires a carefully designed reward function. Adversarial Imitation Learning (AIL) sheds light on automatic policy acquisition without access to the reward signal from the environment. In this work, we propose Auto-Encoding Adversarial Imitation Learning (AEAIL), a robust and scalable AIL framework. To induce expert policies from demonstrations, AEAIL utilizes the reconstruction error of an auto-encoder as a reward signal, which provides more information for optimizing policies than the prior discriminator-based ones. Subsequently, we use the derived objective functions to train the auto-encoder and the agent policy. Experiments show that our AEAIL performs superior compared to state-of-the-art methods on both state and image based environments. More importantly, AEAIL shows much better robustness when the expert demonstrations are noisy.
Paper Structure (26 sections, 5 theorems, 18 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 5 theorems, 18 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.1

Provided that $f(x)$ is K-Lipschitz, then the reward function formulation $1/(1+(x-f(x))^2)$ is 2m(K+1)-Lipschitz, where $m$ is a bounded constant.

Figures (8)

  • Figure 1: Training framework of our AEAIL. The auto-encoder computes the reconstruction error for these two mini-batches of data examples and optimizes the derived objectives. Therefore, a surrogate reward function $r_w(\delta)$ provides the signal to the agent. The agent can induce the expert policies by iteratively training the auto-encoder and the agent policy.
  • Figure 2: Mean and standard deviation return of the final policy performance over 10 rollouts and 3 seeds.
  • Figure 3: Mean and standard deviation return of the deterministic evaluation policy over 10 rollouts and 5 seeds, reported every 100k timesteps, which are learning from clean expert demonstrations.
  • Figure 4: Mean and standard deviation return of the deterministic evaluation policy over 10 rollouts and 5 seeds learning from noisy expert demonstrations, reported every 100k timesteps.
  • Figure 5: Relative improvement of our AEAIL compared to JSD with discriminator-based rewards in different environments. Environments are sorted by the number of state-action dimensions.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Theorem 4.1
  • Corollary 4.2
  • Corollary 4.3
  • Theorem 7.1
  • Corollary 7.2