Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellman Constraints

Tian Xu; Chenyang Wang; Xiaochen Zhai; Ziniu Li; Yi-Chen Li; Yang Yu

Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellman Constraints

Tian Xu, Chenyang Wang, Xiaochen Zhai, Ziniu Li, Yi-Chen Li, Yang Yu

Abstract

Adversarial imitation learning (AIL) achieves high-quality imitation by mitigating compounding errors in behavioral cloning (BC), but often exhibits training instability due to adversarial optimization. To avoid this issue, a class of non-adversarial Q-based imitation learning (IL) methods, represented by IQ-Learn, has emerged and is widely believed to outperform BC by leveraging online environment interactions. However, this paper revisits IQ-Learn and demonstrates that it provably reduces to BC and suffers from an imitation gap lower bound with quadratic dependence on horizon, therefore still suffering from compounding errors. Theoretical analysis reveals that, despite using online interactions, IQ-Learn uniformly suppresses the Q-values for all actions on states uncovered by demonstrations, thereby failing to generalize. To address this limitation, we introduce a primal-dual framework for distribution matching, yielding a new Q-based IL method, Dual Q-DM. The key mechanism in Dual Q-DM is incorporating Bellman constraints to propagate high Q-values from visited states to unvisited ones, thereby achieving generalization beyond demonstrations. We prove that Dual Q-DM is equivalent to AIL and can recover expert actions beyond demonstrations, thereby mitigating compounding errors. To the best of our knowledge, Dual Q-DM is the first non-adversarial IL method that is theoretically guaranteed to eliminate compounding errors. Experimental results further corroborate our theoretical results.

Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellman Constraints

Abstract

Paper Structure (32 sections, 10 theorems, 131 equations, 3 figures, 8 tables, 2 algorithms)

This paper contains 32 sections, 10 theorems, 131 equations, 3 figures, 8 tables, 2 algorithms.

Introduction
Our Contribution
Preliminaries
Revisiting Inverse Soft Q-Learning
An Introduction to Inverse Soft Q-Learning
Inverse Soft Q-Learning Reduces to Behavioral Cloning
Truly Q-based Distribution Matching with Bellman Constraints
Primal-Dual Framework for Distribution Matching
Bellman Constraints Enable Generalization
Related Work
Experimental Validation
Experimental Set-up
Experimental Results
Conclusion
Additional Related Work
...and 17 more sections

Key Result

Theorem 1

Suppose that $\widehat{Q}$ is the optimal solution to IQ-Learn in eq:iq_learn and $\pi_{\widehat{Q}}$ is the derived policy in eq:softmax_pi. Assume the softmax policy class realizes the BC policy, i.e., $\pi^{\operatorname{BC}} \in \{ \pi_{Q}: Q \in {\mathcal{Q}} \}$. Then the following holds:

Figures (3)

Figure 1: Imitation gaps across different horizons (curves of BC, IQ-Learn (TV), and ValueDICE overlaps).
Figure 2: Learning curves regarding online environment interactions on 5 MuJoCo tasks. Here the $x$-axis is the number of environment interactions and the $y$-axis is the return.
Figure 3: An Example of TD MDP. Here arrows denote the corresponding transitions.

Theorems & Definitions (25)

Theorem 1
Remark 1
Remark 2
Remark 3
Corollary 1
Remark 4
Remark 5
Theorem 2
Remark 6
Definition 1: TD MDPs
...and 15 more

Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellman Constraints

Abstract

Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellman Constraints

Authors

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (25)