Provably Efficient Off-Policy Adversarial Imitation Learning with Convergence Guarantees
Yilei Chen, Vittorio Giammarino, James Queeney, Ioannis Ch. Paschalidis
TL;DR
This work provides the first convergence guarantees for off-policy adversarial imitation learning (AIL) without importance sampling, by marrying KL-divergence regularized mirror-descent policy updates with off-policy reward updates that reuse the $o(\sqrt{K})$ most recent policies. The analysis shows that the distribution shift caused by off-policy reward updates can be controlled, and that the resulting regret bound remains sublinear, with an explicit characterization of the optimal amount of past policy data to reuse. A key theoretical insight is the equivalence between mixed visitation distributions from multiple past policies and a single effective policy, enabling substantial sample efficiency gains, especially in large state spaces. Empirically, the approach yields faster convergence and reduced environment interactions in MiniGrid and MuJoCo tasks, with performance gains depending on the chosen number of past policies $N$.
Abstract
Adversarial Imitation Learning (AIL) faces challenges with sample inefficiency because of its reliance on sufficient on-policy data to evaluate the performance of the current policy during reward function updates. In this work, we study the convergence properties and sample complexity of off-policy AIL algorithms. We show that, even in the absence of importance sampling correction, reusing samples generated by the $o(\sqrt{K})$ most recent policies, where $K$ is the number of iterations of policy updates and reward updates, does not undermine the convergence guarantees of this class of algorithms. Furthermore, our results indicate that the distribution shift error induced by off-policy updates is dominated by the benefits of having more data available. This result provides theoretical support for the sample efficiency of off-policy AIL algorithms. To the best of our knowledge, this is the first work that provides theoretical guarantees for off-policy AIL algorithms.
