Table of Contents
Fetching ...

Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning

Shangzhe Li, Dongruo Zhou, Weitong Zhang

TL;DR

This work addresses online adversarial imitation learning (AIL) with reward-free interaction by introducing MB-AIL, a model-based algorithm that separately learns a reward function adversarially and a transition model for optimistic planning. It establishes horizon-free, second-order regret bounds that depend on instance-specific variance through VaR and the Eluder dimension, and derives information-theoretic lower bounds showing near-minimax optimality for online interaction with limited expert data. A comprehensive hard-instance construction clarifies the roles of reward estimation and model estimation in the fundamental limits. The authors also provide a practical implementation with ensemble world models and neural networks, demonstrating improved sample efficiency on GridWorld and MuJoCo benchmarks compared to strong baselines. Overall, MB-AIL advances theoretical understanding of online interaction in imitation learning and offers a scalable, effective approach for reward-free, model-based AIL in complex environments.

Abstract

We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon $H$, precision $ε$ and the policy variance $σ^2$. Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.

Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning

TL;DR

This work addresses online adversarial imitation learning (AIL) with reward-free interaction by introducing MB-AIL, a model-based algorithm that separately learns a reward function adversarially and a transition model for optimistic planning. It establishes horizon-free, second-order regret bounds that depend on instance-specific variance through VaR and the Eluder dimension, and derives information-theoretic lower bounds showing near-minimax optimality for online interaction with limited expert data. A comprehensive hard-instance construction clarifies the roles of reward estimation and model estimation in the fundamental limits. The authors also provide a practical implementation with ensemble world models and neural networks, demonstrating improved sample efficiency on GridWorld and MuJoCo benchmarks compared to strong baselines. Overall, MB-AIL advances theoretical understanding of online interaction in imitation learning and offers a scalable, effective approach for reward-free, model-based AIL in complex environments.

Abstract

We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon , precision and the policy variance . Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.

Paper Structure

This paper contains 51 sections, 25 theorems, 112 equations, 3 figures, 5 tables, 2 algorithms.

Key Result

Theorem 5.1

For any $\delta \in (0,1)$, let $\beta = 7 \log \left( K\mathcal{N}_{\mathcal{P}}/\delta \right)$. Under Assumptions assumption:bounded_reward and assumption:realizability, if FTRL shalev2007convex is employed as the no-regret algorithm in Line ln:opt in Algorithm alg:mbail, the averaged adversarial where $\widetilde{\mathcal{O}}(\cdot)$ hides the polynominal logarithmic factors on $K, H, N$. $d_E

Figures (3)

  • Figure 1: Structure of the Hard Instance. The reward can only be observed in Green states. The absorbing fail state is shown in gray.
  • Figure 2: Results for GridWorld Experiments. We present results analyzing the impact of varying reward space sizes and different levels of environment stochasticity on adversarial imitation learning and behavioral cloning, reported over 5 random seeds.
  • Figure 3: GridWorld Illustration. An illustration of the $9 \times 9$ GridWorld with the agent initialized at $(0,0)$ is shown.

Theorems & Definitions (37)

  • Definition 3.1: $\ell_p$ Eluder dimension, 10.5555/2999792.2999864
  • Definition 3.2: $\epsilon$-covering number
  • Definition 3.3: Bracketing number, geer2000empirical
  • Theorem 5.1
  • Corollary 5.2
  • Remark 5.3: Expert Sample Complexity
  • Remark 5.4: Online Interaction Complexity
  • Remark 5.5: Deterministic Model and Policy
  • Remark 5.6: Behavioral Cloning (BC) v.s. Adversarial Imitation Learning (AIL)
  • Remark 5.7: Stochasticity in BC vs. AIL
  • ...and 27 more