Agnostic Interactive Imitation Learning: New Theory and Practical Algorithms

Yichen Li; Chicheng Zhang

Agnostic Interactive Imitation Learning: New Theory and Practical Algorithms

Yichen Li, Chicheng Zhang

TL;DR

This work tackles agnostic interactive imitation learning, where the demonstrator may lie outside the learner's policy class, and proposes an oracle-efficient framework that generalizes to arbitrary policy classes. It introduces MFTPL-P, Mixed Follow the Perturbed Leader with Poisson Perturbations, which leverages a covering state distribution $d_0$ and an offline classification oracle to achieve no-regret online imitation learning with provable finite-sample guarantees; a practical variant Bootstrap-Dagger (BD) removes the need for access to $d_0$. Theoretical guarantees show sublinear regret with high probability, translating into competitive performance against the expert under the reduction framework; empirically, MFTPL-P and BD outperform online/offline baselines on four continuous-control tasks and tolerate nonrealizable experts. The paper also discusses limitations of the online-reduction approach, provides detailed proofs, and demonstrates that ensemble-based data collection (via BD) yields substantial practical gains, offering a scalable path for robust imitation learning in complex environments.

Abstract

We study interactive imitation learning, where a learner interactively queries a demonstrating expert for action annotations, aiming to learn a policy that has performance competitive with the expert, using as few annotations as possible. We focus on the general agnostic setting where the expert demonstration policy may not be contained in the policy class used by the learner. We propose a new oracle-efficient algorithm MFTPL-P (abbreviation for Mixed Follow the Perturbed Leader with Poisson perturbations) with provable finite-sample guarantees, under the assumption that the learner is given access to samples from some ``explorative'' distribution over states. Our guarantees hold for any policy class, which is considerably broader than prior state of the art. We further propose Bootstrap-Dagger, a more practical variant that does not require additional sample access. Empirically, MFTPL-P and Bootstrap-Dagger notably surpass online and offline imitation learning baselines in continuous control tasks.

Agnostic Interactive Imitation Learning: New Theory and Practical Algorithms

TL;DR

and an offline classification oracle to achieve no-regret online imitation learning with provable finite-sample guarantees; a practical variant Bootstrap-Dagger (BD) removes the need for access to

. Theoretical guarantees show sublinear regret with high probability, translating into competitive performance against the expert under the reduction framework; empirically, MFTPL-P and BD outperform online/offline baselines on four continuous-control tasks and tolerate nonrealizable experts. The paper also discusses limitations of the online-reduction approach, provides detailed proofs, and demonstrates that ensemble-based data collection (via BD) yields substantial practical gains, offering a scalable path for robust imitation learning in complex environments.

Abstract

Paper Structure (26 sections, 20 theorems, 106 equations, 18 figures, 3 tables, 4 algorithms)

This paper contains 26 sections, 20 theorems, 106 equations, 18 figures, 3 tables, 4 algorithms.

Introduction
Related Work
Preliminaries
Oracle-efficient Imitation Learning: Algorithm and Analysis for General Policy Classes
Experiments
Experiment Settings
Utility of Sample-based Perturbation
Performance Evaluation of $\textsc{MFTPL-P}$ and Its Practical Variant $\textsc{Bootstrap-DAgger}$
Explaining the benefit of $\textsc{Bootstrap-DAgger}$
Conclusion
The Online Imitation Learning Reduction Framework
Limitations of the reduction-based framework
Proofs for Section \ref{['sec:mftplp']}
Notations and algorithm
Auxiliary Lemmas
...and 11 more sections

Key Result

Theorem 2

Suppose $(\mathcal{M},\pi^{\mathrm{exp}})$ is $\mu$-recoverable with respect to $\ell$. Define the regret of the sequence of policies $\left\{\pi_n\right\}_{n=1}^N$ w.r.t. policy class $\mathcal{B}$ as: Then $\hat{\pi}$, which is by choosing a policy uniformly at random from $\left\{\pi_n\right\}_{n=1}^N$ and adhering to it satisfies:

Figures (18)

Figure 1: Comparative performance of $\textsc{MFTPL-P}$ using linear models with nonrealizable MLP experts: variation across different perturbation state sources and set sizes in Ant and Hopper. Shaded region represents range between $10^{\text{th}}$ and $90^{\text{th}}$ quantiles of bootstrap confidence interval diciccio1996bootstrap, computed over 10 runs. On the left, the perturbation example sources are states collected by $\textsc{DAgger}$ on each task, while the right side uses uniform distribution over $[-2,2]^{28}$ (Ant) and $[-2,2]^{11}$ (Hopper). Overall, $\textsc{MP-25}$$\text{s}$ on the left exceed their counterparts on the right. Meanwhile, $\textsc{MP-25}$(15) leads in performance, except in the Ant with uniform $d_0$ (upper right).
Figure 2: Results on continuous control tasks with realizable and non-realizable experts. Remarkably, $\textsc{MP-25(15)}$ (magenta), $\textsc{BD-25}$(blue-green) and $\textsc{BD-5}$ (green) surpass baselines under both settings, with distinct performance gaps particularly evident in the non-realizable setting between $\textsc{MP-25(15)}$, $\textsc{BD-25}$, $\textsc{BD-5}$, and the baselines.
Figure 3: Results on comparing $\textsc{BD-5}$ and $\textsc{DAgger}$, along with the two additional approaches in Section \ref{['sec:benefit_of_ensemble']}, over Ant and Hopper. Bagging on data collected by $\textsc{DAgger}$ yields pink learning curves that align closely with $\textsc{DAgger}$'s performance (red). Meanwhile, naive supervised learning on data collected by $\textsc{BD-5}$ produce lime green learning curves that match the performance of $\textsc{BD-5}$ (green). Overall the two methods (red and pink) that uses ensembles to perform data collection has better performance than those two that does not (green and lime green). This suggests that $\textsc{BD-5}$ improves over $\textsc{DAgger}$ by collecting better data.
Figure 4: Example MDP showing the limitation of reduction-based framework.
Figure 5: Dependency graph of notations that appear in the analysis. Solid and dashed arrows indicate deterministic and stochastic dependence, respectively. Note that all $(Q_{n+1,e})_{e \in [E]}$'s are drawn independently from fixed sample perturbation distributions and can be treated as fresh iid random examples.
...and 13 more figures

Theorems & Definitions (26)

Definition 1
Theorem 2: ross2011reduction
Theorem 3
Corollary 4
Theorem 5: Restatement of Theorem \ref{['thm:reduction']}, originally from ross2011reduction, Theorem 3.2
Lemma 6: Performance Difference Lemma, Lemma 4.3 of ross2014reinforcement
Proposition 7
Remark 8
Remark 9
Remark 10
...and 16 more

Agnostic Interactive Imitation Learning: New Theory and Practical Algorithms

TL;DR

Abstract

Agnostic Interactive Imitation Learning: New Theory and Practical Algorithms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (26)