Table of Contents
Fetching ...

Provably Efficient Off-Policy Adversarial Imitation Learning with Convergence Guarantees

Yilei Chen, Vittorio Giammarino, James Queeney, Ioannis Ch. Paschalidis

TL;DR

This work provides the first convergence guarantees for off-policy adversarial imitation learning (AIL) without importance sampling, by marrying KL-divergence regularized mirror-descent policy updates with off-policy reward updates that reuse the $o(\sqrt{K})$ most recent policies. The analysis shows that the distribution shift caused by off-policy reward updates can be controlled, and that the resulting regret bound remains sublinear, with an explicit characterization of the optimal amount of past policy data to reuse. A key theoretical insight is the equivalence between mixed visitation distributions from multiple past policies and a single effective policy, enabling substantial sample efficiency gains, especially in large state spaces. Empirically, the approach yields faster convergence and reduced environment interactions in MiniGrid and MuJoCo tasks, with performance gains depending on the chosen number of past policies $N$.

Abstract

Adversarial Imitation Learning (AIL) faces challenges with sample inefficiency because of its reliance on sufficient on-policy data to evaluate the performance of the current policy during reward function updates. In this work, we study the convergence properties and sample complexity of off-policy AIL algorithms. We show that, even in the absence of importance sampling correction, reusing samples generated by the $o(\sqrt{K})$ most recent policies, where $K$ is the number of iterations of policy updates and reward updates, does not undermine the convergence guarantees of this class of algorithms. Furthermore, our results indicate that the distribution shift error induced by off-policy updates is dominated by the benefits of having more data available. This result provides theoretical support for the sample efficiency of off-policy AIL algorithms. To the best of our knowledge, this is the first work that provides theoretical guarantees for off-policy AIL algorithms.

Provably Efficient Off-Policy Adversarial Imitation Learning with Convergence Guarantees

TL;DR

This work provides the first convergence guarantees for off-policy adversarial imitation learning (AIL) without importance sampling, by marrying KL-divergence regularized mirror-descent policy updates with off-policy reward updates that reuse the most recent policies. The analysis shows that the distribution shift caused by off-policy reward updates can be controlled, and that the resulting regret bound remains sublinear, with an explicit characterization of the optimal amount of past policy data to reuse. A key theoretical insight is the equivalence between mixed visitation distributions from multiple past policies and a single effective policy, enabling substantial sample efficiency gains, especially in large state spaces. Empirically, the approach yields faster convergence and reduced environment interactions in MiniGrid and MuJoCo tasks, with performance gains depending on the chosen number of past policies .

Abstract

Adversarial Imitation Learning (AIL) faces challenges with sample inefficiency because of its reliance on sufficient on-policy data to evaluate the performance of the current policy during reward function updates. In this work, we study the convergence properties and sample complexity of off-policy AIL algorithms. We show that, even in the absence of importance sampling correction, reusing samples generated by the most recent policies, where is the number of iterations of policy updates and reward updates, does not undermine the convergence guarantees of this class of algorithms. Furthermore, our results indicate that the distribution shift error induced by off-policy updates is dominated by the benefits of having more data available. This result provides theoretical support for the sample efficiency of off-policy AIL algorithms. To the best of our knowledge, this is the first work that provides theoretical guarantees for off-policy AIL algorithms.
Paper Structure (27 sections, 16 theorems, 69 equations, 2 figures, 1 algorithm)

This paper contains 27 sections, 16 theorems, 69 equations, 2 figures, 1 algorithm.

Key Result

Lemma 3.2

The regret of an AIL algorithm over $K$ updates can be bounded by

Figures (2)

  • Figure 1: Experimental results for our off-policy AIL algorithm with different $N$ in three MiniGrid EmptyRoom tasks with room sizes equal to $3\times 3$, $5\times 5$, and $9\times 9$, respectively, from left to right. Training curves represent total reward per episode as a function of environment interactions. We evaluate the learned policy using average performance over $5$ episodes. $N$ denotes the number of most recent policies we consider during reward updates, where $N = 1$ represents the on-policy algorithm. The expert’s demonstration consists of $4$ trajectories which are hand-crafted. We run each experiment for $5$ different seeds and the shading represents the standard deviation. For more implementation details, please refer to Section \ref{['sec:minigrid_details']} in the Appendix.
  • Figure 2: Experimental results for our off-policy AIL algorithm with different $N$ in three continuous space MuJoCo locomotion environments: HalfCheetah-v2, Hopper-v2, and Walker2d-v2. Training curves represent total reward per episode as a function of environment interactions. We evaluate the learned policy using average performance over $10$ episodes. $N$ denotes the number of most recent policies we consider during reward updates, where $N = 1$ represents the on-policy algorithm. The expert's demonstration consists of $10$ trajectories which are trained by Soft Actor-Critic haarnoja2018soft. We run each experiment for $10$ different seeds and the shading represents the standard error. For more implementation details, please refer to Section \ref{['sec:mujoco_details']} in the Appendix.

Theorems & Definitions (29)

  • Definition 3.1: AIL Regret
  • Lemma 3.2: Lemma 2 in shani2022online
  • Lemma 4.1: Lemma 4 in shani2022online
  • Lemma 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Theorem 4.5
  • proof
  • Proposition 4.6
  • Proposition 4.7
  • ...and 19 more