Table of Contents
Fetching ...

Provably and Practically Efficient Adversarial Imitation Learning with General Function Approximation

Tian Xu, Zhilong Zhang, Ruishuo Chen, Yihao Sun, Yang Yu

TL;DR

This paper introduces a new method called optimization-based AIL (OPT-AIL), which centers on performing online optimization for reward functions and optimism-regularized Bellman error minimization for Q-value functions and proves that OPT-AIL is the first provably efficient AIL method with general function approximation.

Abstract

As a prominent category of imitation learning methods, adversarial imitation learning (AIL) has garnered significant practical success powered by neural network approximation. However, existing theoretical studies on AIL are primarily limited to simplified scenarios such as tabular and linear function approximation and involve complex algorithmic designs that hinder practical implementation, highlighting a gap between theory and practice. In this paper, we explore the theoretical underpinnings of online AIL with general function approximation. We introduce a new method called optimization-based AIL (OPT-AIL), which centers on performing online optimization for reward functions and optimism-regularized Bellman error minimization for Q-value functions. Theoretically, we prove that OPT-AIL achieves polynomial expert sample complexity and interaction complexity for learning near-expert policies. To our best knowledge, OPT-AIL is the first provably efficient AIL method with general function approximation. Practically, OPT-AIL only requires the approximate optimization of two objectives, thereby facilitating practical implementation. Empirical studies demonstrate that OPT-AIL outperforms previous state-of-the-art deep AIL methods in several challenging tasks.

Provably and Practically Efficient Adversarial Imitation Learning with General Function Approximation

TL;DR

This paper introduces a new method called optimization-based AIL (OPT-AIL), which centers on performing online optimization for reward functions and optimism-regularized Bellman error minimization for Q-value functions and proves that OPT-AIL is the first provably efficient AIL method with general function approximation.

Abstract

As a prominent category of imitation learning methods, adversarial imitation learning (AIL) has garnered significant practical success powered by neural network approximation. However, existing theoretical studies on AIL are primarily limited to simplified scenarios such as tabular and linear function approximation and involve complex algorithmic designs that hinder practical implementation, highlighting a gap between theory and practice. In this paper, we explore the theoretical underpinnings of online AIL with general function approximation. We introduce a new method called optimization-based AIL (OPT-AIL), which centers on performing online optimization for reward functions and optimism-regularized Bellman error minimization for Q-value functions. Theoretically, we prove that OPT-AIL achieves polynomial expert sample complexity and interaction complexity for learning near-expert policies. To our best knowledge, OPT-AIL is the first provably efficient AIL method with general function approximation. Practically, OPT-AIL only requires the approximate optimization of two objectives, thereby facilitating practical implementation. Empirical studies demonstrate that OPT-AIL outperforms previous state-of-the-art deep AIL methods in several challenging tasks.

Paper Structure

This paper contains 24 sections, 9 theorems, 86 equations, 5 figures, 2 tables, 2 algorithms.

Key Result

Lemma 1

Consider a sequence of rewards $\{ r^k \}_{k=1}^K$ and policies $\{ \pi^k \}_{k=1}^K$, and the policy ${ \macc@depth1 \frozen@everymath{\mathgroup\macc@group} \macc@set@skewchar \macc@nested@a111{} }$ is uniformly sampled from $\{ \pi^k \}_{k=1}^K$. Then it holds that

Figures (5)

  • Figure 1: Overall performance on 8 DMControl tasks over 5 random seeds following 500k interactions with the environment. Here the $x$-axis is the number of expert trajectories and the $y$-axis is the return. The solid lines are the mean of results while the shaded region corresponds to the standard deviation over 5 random seeds. Same as the following figures.
  • Figure 2: Learning curves on 8 DMControl tasks over 5 random seeds using 1 expert trajectory. Here the $x$-axis is the number of environment interactions and the $y$-axis is the return.
  • Figure 3: Learning curves on 8 DMControl tasks over 5 random seeds using 4 expert trajectories.
  • Figure 4: Learning curves on 8 DMControl tasks over 5 random seeds using 7 expert trajectories.
  • Figure 5: Learning curves on 8 DMControl tasks over 5 random seeds using 10 expert trajectories.

Theorems & Definitions (13)

  • Definition 1: $\varepsilon$-covering number
  • Lemma 1
  • Definition 2: Reward Optimization Error
  • Theorem 1
  • Lemma 2: Upper Bound on Reward Error
  • Lemma 3: Upper Bound on Policy Error
  • Lemma 4
  • Lemma 5
  • Lemma 6: Freedman's inequality agarwal2014taming
  • Lemma 7
  • ...and 3 more