Agnostic Interactive Imitation Learning: New Theory and Practical Algorithms
Yichen Li, Chicheng Zhang
TL;DR
This work tackles agnostic interactive imitation learning, where the demonstrator may lie outside the learner's policy class, and proposes an oracle-efficient framework that generalizes to arbitrary policy classes. It introduces MFTPL-P, Mixed Follow the Perturbed Leader with Poisson Perturbations, which leverages a covering state distribution $d_0$ and an offline classification oracle to achieve no-regret online imitation learning with provable finite-sample guarantees; a practical variant Bootstrap-Dagger (BD) removes the need for access to $d_0$. Theoretical guarantees show sublinear regret with high probability, translating into competitive performance against the expert under the reduction framework; empirically, MFTPL-P and BD outperform online/offline baselines on four continuous-control tasks and tolerate nonrealizable experts. The paper also discusses limitations of the online-reduction approach, provides detailed proofs, and demonstrates that ensemble-based data collection (via BD) yields substantial practical gains, offering a scalable path for robust imitation learning in complex environments.
Abstract
We study interactive imitation learning, where a learner interactively queries a demonstrating expert for action annotations, aiming to learn a policy that has performance competitive with the expert, using as few annotations as possible. We focus on the general agnostic setting where the expert demonstration policy may not be contained in the policy class used by the learner. We propose a new oracle-efficient algorithm MFTPL-P (abbreviation for Mixed Follow the Perturbed Leader with Poisson perturbations) with provable finite-sample guarantees, under the assumption that the learner is given access to samples from some ``explorative'' distribution over states. Our guarantees hold for any policy class, which is considerably broader than prior state of the art. We further propose Bootstrap-Dagger, a more practical variant that does not require additional sample access. Empirically, MFTPL-P and Bootstrap-Dagger notably surpass online and offline imitation learning baselines in continuous control tasks.
