Provably Efficient Off-Policy Adversarial Imitation Learning with Convergence Guarantees

Yilei Chen; Vittorio Giammarino; James Queeney; Ioannis Ch. Paschalidis

Provably Efficient Off-Policy Adversarial Imitation Learning with Convergence Guarantees

Yilei Chen, Vittorio Giammarino, James Queeney, Ioannis Ch. Paschalidis

TL;DR

This work provides the first convergence guarantees for off-policy adversarial imitation learning (AIL) without importance sampling, by marrying KL-divergence regularized mirror-descent policy updates with off-policy reward updates that reuse the $o(\sqrt{K})$ most recent policies. The analysis shows that the distribution shift caused by off-policy reward updates can be controlled, and that the resulting regret bound remains sublinear, with an explicit characterization of the optimal amount of past policy data to reuse. A key theoretical insight is the equivalence between mixed visitation distributions from multiple past policies and a single effective policy, enabling substantial sample efficiency gains, especially in large state spaces. Empirically, the approach yields faster convergence and reduced environment interactions in MiniGrid and MuJoCo tasks, with performance gains depending on the chosen number of past policies $N$.

Abstract

Adversarial Imitation Learning (AIL) faces challenges with sample inefficiency because of its reliance on sufficient on-policy data to evaluate the performance of the current policy during reward function updates. In this work, we study the convergence properties and sample complexity of off-policy AIL algorithms. We show that, even in the absence of importance sampling correction, reusing samples generated by the $o(\sqrt{K})$ most recent policies, where $K$ is the number of iterations of policy updates and reward updates, does not undermine the convergence guarantees of this class of algorithms. Furthermore, our results indicate that the distribution shift error induced by off-policy updates is dominated by the benefits of having more data available. This result provides theoretical support for the sample efficiency of off-policy AIL algorithms. To the best of our knowledge, this is the first work that provides theoretical guarantees for off-policy AIL algorithms.

Provably Efficient Off-Policy Adversarial Imitation Learning with Convergence Guarantees

TL;DR

most recent policies. The analysis shows that the distribution shift caused by off-policy reward updates can be controlled, and that the resulting regret bound remains sublinear, with an explicit characterization of the optimal amount of past policy data to reuse. A key theoretical insight is the equivalence between mixed visitation distributions from multiple past policies and a single effective policy, enabling substantial sample efficiency gains, especially in large state spaces. Empirically, the approach yields faster convergence and reduced environment interactions in MiniGrid and MuJoCo tasks, with performance gains depending on the chosen number of past policies

Abstract

most recent policies, where

is the number of iterations of policy updates and reward updates, does not undermine the convergence guarantees of this class of algorithms. Furthermore, our results indicate that the distribution shift error induced by off-policy updates is dominated by the benefits of having more data available. This result provides theoretical support for the sample efficiency of off-policy AIL algorithms. To the best of our knowledge, this is the first work that provides theoretical guarantees for off-policy AIL algorithms.

Paper Structure (27 sections, 16 theorems, 69 equations, 2 figures, 1 algorithm)

This paper contains 27 sections, 16 theorems, 69 equations, 2 figures, 1 algorithm.

Introduction
Related Work
Preliminaries
Reinforcement learning
Adversarial imitation learning
Remark.
Off-Policy Adversarial Imitation Learning
Convergent Off-Policy AIL
Policy updates
Reward updates
Main result
Sample Efficient Off-Policy AIL
Experiments
MiniGrid Environments
MuJoCo Benchmarks
...and 12 more sections

Key Result

Lemma 3.2

The regret of an AIL algorithm over $K$ updates can be bounded by

Figures (2)

Figure 1: Experimental results for our off-policy AIL algorithm with different $N$ in three MiniGrid EmptyRoom tasks with room sizes equal to $3\times 3$, $5\times 5$, and $9\times 9$, respectively, from left to right. Training curves represent total reward per episode as a function of environment interactions. We evaluate the learned policy using average performance over $5$ episodes. $N$ denotes the number of most recent policies we consider during reward updates, where $N = 1$ represents the on-policy algorithm. The expert’s demonstration consists of $4$ trajectories which are hand-crafted. We run each experiment for $5$ different seeds and the shading represents the standard deviation. For more implementation details, please refer to Section \ref{['sec:minigrid_details']} in the Appendix.
Figure 2: Experimental results for our off-policy AIL algorithm with different $N$ in three continuous space MuJoCo locomotion environments: HalfCheetah-v2, Hopper-v2, and Walker2d-v2. Training curves represent total reward per episode as a function of environment interactions. We evaluate the learned policy using average performance over $10$ episodes. $N$ denotes the number of most recent policies we consider during reward updates, where $N = 1$ represents the on-policy algorithm. The expert's demonstration consists of $10$ trajectories which are trained by Soft Actor-Critic haarnoja2018soft. We run each experiment for $10$ different seeds and the shading represents the standard error. For more implementation details, please refer to Section \ref{['sec:mujoco_details']} in the Appendix.

Theorems & Definitions (29)

Definition 3.1: AIL Regret
Lemma 3.2: Lemma 2 in shani2022online
Lemma 4.1: Lemma 4 in shani2022online
Lemma 4.2
Theorem 4.3
Theorem 4.4
Theorem 4.5
proof
Proposition 4.6
Proposition 4.7
...and 19 more

Provably Efficient Off-Policy Adversarial Imitation Learning with Convergence Guarantees

TL;DR

Abstract

Provably Efficient Off-Policy Adversarial Imitation Learning with Convergence Guarantees

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (29)