Table of Contents
Fetching ...

Inverse Reinforcement Learning without Reinforcement Learning

Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, Zhiwei Steven Wu

TL;DR

The paper tackles the inefficiency of traditional IRL, where inner-loop RL solves dominate computation, by introducing expert resets that exploit the expert state distribution. It presents two exponential-speedup reductions, MMDP and NRMM, with polynomial sample complexity, and a meta-algorithm FILTER that blends resets with standard exploration to balance horizon errors. Theoretical results show improved sample complexity and error bounds, while experiments on continuous-control benchmarks demonstrate faster and more robust imitation learning. The work offers a reduction-based framework for faster IRL and suggests broadly applicable directions for leveraging expert demonstrations across problems.

Abstract

Inverse Reinforcement Learning (IRL) is a powerful set of techniques for imitation learning that aims to learn a reward function that rationalizes expert demonstrations. Unfortunately, traditional IRL methods suffer from a computational weakness: they require repeatedly solving a hard reinforcement learning (RL) problem as a subroutine. This is counter-intuitive from the viewpoint of reductions: we have reduced the easier problem of imitation learning to repeatedly solving the harder problem of RL. Another thread of work has proved that access to the side-information of the distribution of states where a strong policy spends time can dramatically reduce the sample and computational complexities of solving an RL problem. In this work, we demonstrate for the first time a more informed imitation learning reduction where we utilize the state distribution of the expert to alleviate the global exploration component of the RL subroutine, providing an exponential speedup in theory. In practice, we find that we are able to significantly speed up the prior art on continuous control tasks.

Inverse Reinforcement Learning without Reinforcement Learning

TL;DR

The paper tackles the inefficiency of traditional IRL, where inner-loop RL solves dominate computation, by introducing expert resets that exploit the expert state distribution. It presents two exponential-speedup reductions, MMDP and NRMM, with polynomial sample complexity, and a meta-algorithm FILTER that blends resets with standard exploration to balance horizon errors. Theoretical results show improved sample complexity and error bounds, while experiments on continuous-control benchmarks demonstrate faster and more robust imitation learning. The work offers a reduction-based framework for faster IRL and suggests broadly applicable directions for leveraging expert demonstrations across problems.

Abstract

Inverse Reinforcement Learning (IRL) is a powerful set of techniques for imitation learning that aims to learn a reward function that rationalizes expert demonstrations. Unfortunately, traditional IRL methods suffer from a computational weakness: they require repeatedly solving a hard reinforcement learning (RL) problem as a subroutine. This is counter-intuitive from the viewpoint of reductions: we have reduced the easier problem of imitation learning to repeatedly solving the harder problem of RL. Another thread of work has proved that access to the side-information of the distribution of states where a strong policy spends time can dramatically reduce the sample and computational complexities of solving an RL problem. In this work, we demonstrate for the first time a more informed imitation learning reduction where we utilize the state distribution of the expert to alleviate the global exploration component of the RL subroutine, providing an exponential speedup in theory. In practice, we find that we are able to significantly speed up the prior art on continuous control tasks.
Paper Structure (27 sections, 13 theorems, 54 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 27 sections, 13 theorems, 54 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Theorem 3.1

Inverse RL Sample Complexity: For Algorithms alg:dual-irl and alg:primal-irl, there exists an MDP, $\pi_E$, $\Pi$, and $\mathcal{F}_r$ such that returning a policy $\pi$ which satisfies $J(\pi_E, r) - J(\pi, r) \leq 0.5 V_{max}$ requires $\Omega(|\mathcal{A}|^T)$ interactions with the environment, w

Figures (4)

  • Figure 1: Traditional Inverse RL methods (left) repeatedly solve RL problems with adversarially chosen rewards in their inner loop which can be rather computationally expensive. We introduce two exponentially faster methods for IRL. NRMM (No-Regret Moment Matching, center) resets the learner to states from the expert demonstrations before comparing trajectory suffixes. MMDP (Moment Matching by Dynamic Programming, right) optimizes a sequence of policies backwards in time. Both methods avoid solving the global exploration problem inherent in RL.
  • Figure 2: dante: A three-row MDP where at each timestep, the learner can move up, move down, or stay in the same row. The expert always stays in the center row. The goal is to stay in the top two rows.
  • Figure 3: We see that both FILTER(BR) and FILTER(NR) out-performs standard MM and BC on 4 out of the 5 environments considered. Standard errors are computed across 10 seeds.
  • Figure :

Theorems & Definitions (27)

  • Theorem 3.1
  • Lemma 3.2
  • Theorem 3.3
  • Theorem 3.4
  • Lemma 3.5
  • Theorem 3.6
  • Theorem 3.7
  • Theorem 3.8
  • Corollary 4.1
  • Lemma 1.1
  • ...and 17 more