Table of Contents
Fetching ...

Interactive and Hybrid Imitation Learning: Provably Beating Behavior Cloning

Yichen Li, Chicheng Zhang

TL;DR

This work investigates imitation learning under two data regimes: offline demonstrations and interactive, state-wise feedback. It introduces Stagger, a state-wise DAgger variant, and shows provable sample-efficiency gains over Behavior Cloning under recoverable MDPs; it further develops Warm-Stagger, a hybrid method that blends offline trajectories with interactive queries and offers guarantees not much worse than either data source alone. The paper also demonstrates a tabular MDP where hybrid IL strictly outperforms both baselines and validates the approaches on MuJoCo continuous-control tasks under modest annotation costs. Overall, it highlights the practical and theoretical value of state-wise interactive annotation and hybrid feedback for efficient imitation learning.

Abstract

Imitation learning (IL) is a paradigm for learning sequential decision making policies from experts, leveraging offline demonstrations, interactive annotations, or both. Recent advances show that when annotation cost is tallied per trajectory, Behavior Cloning (BC) which relies solely on offline demonstrations cannot be improved in general, leaving limited conditions for interactive methods such as DAgger to help. We revisit this conclusion and prove that when the annotation cost is measured per state, algorithms using interactive annotations can provably outperform BC. Specifically: (1) we show that Stagger, a one sample per round variant of DAgger, provably beats BC under low recovery cost settings; (2) we initiate the study of hybrid IL where the agent learns from offline demonstrations and interactive annotations. We propose Warm Stagger whose learning guarantee is not much worse than using either data source alone. Furthermore, motivated by compounding error and cold start problem in imitation learning practice, we give an MDP example in which Warm Stagger has significant better annotation cost; (3) experiments on MuJoCo continuous control tasks confirm that, with modest cost ratio between interactive and offline annotations, interactive and hybrid approaches consistently outperform BC. To the best of our knowledge, our work is the first to highlight the benefit of state wise interactive annotation and hybrid feedback in imitation learning.

Interactive and Hybrid Imitation Learning: Provably Beating Behavior Cloning

TL;DR

This work investigates imitation learning under two data regimes: offline demonstrations and interactive, state-wise feedback. It introduces Stagger, a state-wise DAgger variant, and shows provable sample-efficiency gains over Behavior Cloning under recoverable MDPs; it further develops Warm-Stagger, a hybrid method that blends offline trajectories with interactive queries and offers guarantees not much worse than either data source alone. The paper also demonstrates a tabular MDP where hybrid IL strictly outperforms both baselines and validates the approaches on MuJoCo continuous-control tasks under modest annotation costs. Overall, it highlights the practical and theoretical value of state-wise interactive annotation and hybrid feedback for efficient imitation learning.

Abstract

Imitation learning (IL) is a paradigm for learning sequential decision making policies from experts, leveraging offline demonstrations, interactive annotations, or both. Recent advances show that when annotation cost is tallied per trajectory, Behavior Cloning (BC) which relies solely on offline demonstrations cannot be improved in general, leaving limited conditions for interactive methods such as DAgger to help. We revisit this conclusion and prove that when the annotation cost is measured per state, algorithms using interactive annotations can provably outperform BC. Specifically: (1) we show that Stagger, a one sample per round variant of DAgger, provably beats BC under low recovery cost settings; (2) we initiate the study of hybrid IL where the agent learns from offline demonstrations and interactive annotations. We propose Warm Stagger whose learning guarantee is not much worse than using either data source alone. Furthermore, motivated by compounding error and cold start problem in imitation learning practice, we give an MDP example in which Warm Stagger has significant better annotation cost; (3) experiments on MuJoCo continuous control tasks confirm that, with modest cost ratio between interactive and offline annotations, interactive and hybrid approaches consistently outperform BC. To the best of our knowledge, our work is the first to highlight the benefit of state wise interactive annotation and hybrid feedback in imitation learning.

Paper Structure

This paper contains 54 sections, 31 theorems, 150 equations, 8 figures, 4 algorithms.

Key Result

Theorem 2

Suppose Assumption assum:realizability holds, then with probability $1-\delta$, the policy returned by BC $\hat{\pi}$ satisfies:

Figures (8)

  • Figure 1: State-wise sample complexity comparison between Behavior Cloning and $\textsc{Stagger}$. Shaded areas show the 10th–90th percentile bootstrap confidence intervals diciccio1996bootstrap over 10 runs. $\textsc{Stagger}$ matches or exceeds BC with $50\%$ fewer annotations, achieving better state-wise annotation efficiency.
  • Figure 2: MDP construction and simulation results of algorithms with rewards assigned only in $\mathbf{E}$. We evaluate $\textsc{Warm-Stagger}$ (WS) with 200, 800, 3200 offline (state, expert action) pairs. All methods are evaluated under equal total annotation cost with $C = 1$. With 800 offline (state, expert action) pairs, WS significantly improves the sample efficiency over the baselines and explores $\mathbf{E'}$ more effectively.
  • Figure 3: Sample and cost efficiency on MuJoCo tasks. The top row shows expected return vs. number of annotations ($C = 1$); the bottom row shows performance under a cost-aware setting ($C = 2$). $\textsc{Warm-Stagger}$ (WS) is initialized with 1/8, 1/4, or 1/2 of the total annotation budget as offline demonstrations. Specifically, WS($n$) refers to WS with offline expert trajectory demonstrations of total length $n$. For a good range of $n$'s, WS($n$) matches $\textsc{Stagger}$ in sample efficiency and outperforms the baselines when $C = 2$.
  • Figure 4: Sample and cost efficiency on MuJoCo tasks. The top row shows expected return vs. number of annotations ($C = 1$); the bottom row shows performance under a cost-aware setting ($C = 2$). $\textsc{Warm-Stagger}$ (WS) is initialized with 1/20, 1/10, or 1/5 of the samples as offline demonstrations. It matches $\textsc{Stagger}$ in sample efficiency and outperforms the baselines when $C = 2$, especially WS(1/5).
  • Figure 5: Performance comparison under MSE loss across MuJoCo tasks. Results show that $\textsc{Warm-Stagger}$ (WS) achieves comparable sample efficiency and performance to the log loss setting, with improved training stability. Each curve represents the average over 10 seeds.
  • ...and 3 more figures

Theorems & Definitions (41)

  • Definition 1: Each-step Mixing of $\mathcal{B}$
  • Theorem 2: Guarantee of BC foster2024behavior
  • Theorem 3
  • Definition 4: Each-step policy completion
  • Theorem 5
  • Remark 6
  • Theorem 7
  • Definition 8: Trajectory-wise $L_1$-divergence
  • Definition 9: State-wise Hellinger distance
  • Lemma 10
  • ...and 31 more