Table of Contents
Fetching ...

Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellman Constraints

Tian Xu, Chenyang Wang, Xiaochen Zhai, Ziniu Li, Yi-Chen Li, Yang Yu

Abstract

Adversarial imitation learning (AIL) achieves high-quality imitation by mitigating compounding errors in behavioral cloning (BC), but often exhibits training instability due to adversarial optimization. To avoid this issue, a class of non-adversarial Q-based imitation learning (IL) methods, represented by IQ-Learn, has emerged and is widely believed to outperform BC by leveraging online environment interactions. However, this paper revisits IQ-Learn and demonstrates that it provably reduces to BC and suffers from an imitation gap lower bound with quadratic dependence on horizon, therefore still suffering from compounding errors. Theoretical analysis reveals that, despite using online interactions, IQ-Learn uniformly suppresses the Q-values for all actions on states uncovered by demonstrations, thereby failing to generalize. To address this limitation, we introduce a primal-dual framework for distribution matching, yielding a new Q-based IL method, Dual Q-DM. The key mechanism in Dual Q-DM is incorporating Bellman constraints to propagate high Q-values from visited states to unvisited ones, thereby achieving generalization beyond demonstrations. We prove that Dual Q-DM is equivalent to AIL and can recover expert actions beyond demonstrations, thereby mitigating compounding errors. To the best of our knowledge, Dual Q-DM is the first non-adversarial IL method that is theoretically guaranteed to eliminate compounding errors. Experimental results further corroborate our theoretical results.

Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellman Constraints

Abstract

Adversarial imitation learning (AIL) achieves high-quality imitation by mitigating compounding errors in behavioral cloning (BC), but often exhibits training instability due to adversarial optimization. To avoid this issue, a class of non-adversarial Q-based imitation learning (IL) methods, represented by IQ-Learn, has emerged and is widely believed to outperform BC by leveraging online environment interactions. However, this paper revisits IQ-Learn and demonstrates that it provably reduces to BC and suffers from an imitation gap lower bound with quadratic dependence on horizon, therefore still suffering from compounding errors. Theoretical analysis reveals that, despite using online interactions, IQ-Learn uniformly suppresses the Q-values for all actions on states uncovered by demonstrations, thereby failing to generalize. To address this limitation, we introduce a primal-dual framework for distribution matching, yielding a new Q-based IL method, Dual Q-DM. The key mechanism in Dual Q-DM is incorporating Bellman constraints to propagate high Q-values from visited states to unvisited ones, thereby achieving generalization beyond demonstrations. We prove that Dual Q-DM is equivalent to AIL and can recover expert actions beyond demonstrations, thereby mitigating compounding errors. To the best of our knowledge, Dual Q-DM is the first non-adversarial IL method that is theoretically guaranteed to eliminate compounding errors. Experimental results further corroborate our theoretical results.
Paper Structure (32 sections, 10 theorems, 131 equations, 3 figures, 8 tables, 2 algorithms)

This paper contains 32 sections, 10 theorems, 131 equations, 3 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

Suppose that $\widehat{Q}$ is the optimal solution to IQ-Learn in eq:iq_learn and $\pi_{\widehat{Q}}$ is the derived policy in eq:softmax_pi. Assume the softmax policy class realizes the BC policy, i.e., $\pi^{\operatorname{BC}} \in \{ \pi_{Q}: Q \in {\mathcal{Q}} \}$. Then the following holds:

Figures (3)

  • Figure 1: Imitation gaps across different horizons (curves of BC, IQ-Learn (TV), and ValueDICE overlaps).
  • Figure 2: Learning curves regarding online environment interactions on 5 MuJoCo tasks. Here the $x$-axis is the number of environment interactions and the $y$-axis is the return.
  • Figure 3: An Example of TD MDP. Here arrows denote the corresponding transitions.

Theorems & Definitions (25)

  • Theorem 1
  • Remark 1
  • Remark 2
  • Remark 3
  • Corollary 1
  • Remark 4
  • Remark 5
  • Theorem 2
  • Remark 6
  • Definition 1: TD MDPs
  • ...and 15 more