Table of Contents
Fetching ...

Hybrid Inverse Reinforcement Learning

Juntao Ren, Gokul Swamy, Zhiwei Steven Wu, J. Andrew Bagnell, Sanjiban Choudhury

TL;DR

This work addresses the sample-inefficiency of inverse reinforcement learning by introducing Hybrid Inverse Reinforcement Learning, a reduction to expert-competitive RL that uses expert data within the policy search. It presents two algorithms, HyPE (model-free) and HyPER (model-based), which leverage a mixture of expert and learner data to dramatically reduce inner-loop exploration while maintaining performance guarantees. The authors formalize Expert-Relative Regret Oracles (ERROr) and demonstrate both theoretical guarantees and empirical gains on continuous-control benchmarks, including MuJoCo and D4RL antmaze. The approach offers flexible trade-offs depending on environment access and model availability, providing a practical path to more sample-efficient imitation learning in robotics and related domains.

Abstract

The inverse reinforcement learning approach to imitation learning is a double-edged sword. On the one hand, it can enable learning from a smaller number of expert demonstrations with more robustness to error compounding than behavioral cloning approaches. On the other hand, it requires that the learner repeatedly solve a computationally expensive reinforcement learning (RL) problem. Often, much of this computation is wasted searching over policies very dissimilar to the expert's. In this work, we propose using hybrid RL -- training on a mixture of online and expert data -- to curtail unnecessary exploration. Intuitively, the expert data focuses the learner on good states during training, which reduces the amount of exploration required to compute a strong policy. Notably, such an approach doesn't need the ability to reset the learner to arbitrary states in the environment, a requirement of prior work in efficient inverse RL. More formally, we derive a reduction from inverse RL to expert-competitive RL (rather than globally optimal RL) that allows us to dramatically reduce interaction during the inner policy search loop while maintaining the benefits of the IRL approach. This allows us to derive both model-free and model-based hybrid inverse RL algorithms with strong policy performance guarantees. Empirically, we find that our approaches are significantly more sample efficient than standard inverse RL and several other baselines on a suite of continuous control tasks.

Hybrid Inverse Reinforcement Learning

TL;DR

This work addresses the sample-inefficiency of inverse reinforcement learning by introducing Hybrid Inverse Reinforcement Learning, a reduction to expert-competitive RL that uses expert data within the policy search. It presents two algorithms, HyPE (model-free) and HyPER (model-based), which leverage a mixture of expert and learner data to dramatically reduce inner-loop exploration while maintaining performance guarantees. The authors formalize Expert-Relative Regret Oracles (ERROr) and demonstrate both theoretical guarantees and empirical gains on continuous-control benchmarks, including MuJoCo and D4RL antmaze. The approach offers flexible trade-offs depending on environment access and model availability, providing a practical path to more sample-efficient imitation learning in robotics and related domains.

Abstract

The inverse reinforcement learning approach to imitation learning is a double-edged sword. On the one hand, it can enable learning from a smaller number of expert demonstrations with more robustness to error compounding than behavioral cloning approaches. On the other hand, it requires that the learner repeatedly solve a computationally expensive reinforcement learning (RL) problem. Often, much of this computation is wasted searching over policies very dissimilar to the expert's. In this work, we propose using hybrid RL -- training on a mixture of online and expert data -- to curtail unnecessary exploration. Intuitively, the expert data focuses the learner on good states during training, which reduces the amount of exploration required to compute a strong policy. Notably, such an approach doesn't need the ability to reset the learner to arbitrary states in the environment, a requirement of prior work in efficient inverse RL. More formally, we derive a reduction from inverse RL to expert-competitive RL (rather than globally optimal RL) that allows us to dramatically reduce interaction during the inner policy search loop while maintaining the benefits of the IRL approach. This allows us to derive both model-free and model-based hybrid inverse RL algorithms with strong policy performance guarantees. Empirically, we find that our approaches are significantly more sample efficient than standard inverse RL and several other baselines on a suite of continuous control tasks.
Paper Structure (20 sections, 7 theorems, 28 equations, 8 figures, 7 tables, 4 algorithms)

This paper contains 20 sections, 7 theorems, 28 equations, 8 figures, 7 tables, 4 algorithms.

Key Result

Theorem 3.3

Assume access to an $\mathbb{A}_{\pi}$ and $\mathbb{A}_f$ that satisfy Definitions def:error and def:rewsel respectively. Set $\pi_{t+1} = \mathbb{A}_{\pi}(f_{1:t})$ and $f_{t+1} = \mathbb{A}_f(\ell_{1:t})$. Then, $\bar{\pi}$ (the mixture of $\pi_{1:T}$) satisfies

Figures (8)

  • Figure 1: Standard inverse reinforcement algorithms (left) require repeatedly solving a reinforcement learning problem in their inner loop. Thus, the learner is potentially forced to explore the entire state space to find any reward. We introduce hybrid inverse reinforcement learning, where the learner trains on a mixture of its own and the expert's data during the policy search inner loop. This reduces the exploration burden on the learner by providing positive examples. We provide model-free and model-based algorithms that are both significantly more sample efficient than standard inverse RL approaches on continuous control tasks.
  • Figure 2: Difference in rewards between the learner policy $\pi_i$ and expert policy $\pi_E$ under the discriminator function $f_i$ for the first 100k environment interactions in primal IRL.
  • Figure 3: Consider a binary tree MDP. Define $\Pi$ to be the set of all deterministic policies (paths through the tree), and $\mathcal{F}_r$ the class of rewards that always assign $+1$ to the bottom-left node and an additional $+1$ to any one of the three other leaf nodes. The expert (the green path) always takes the leftmost path. Note that the expert is not optimal under any $f \in \mathcal{F}_r$. In the first image, the learner (the orange path) has computed the best response to $f_1$ (the labels on the nodes). To penalize the learner, $f_2$ shifts the reward to a neighboring leaf node. As a result, $\pi_2$ must search through the entire tree to compute the best-response. Beyond the repeated exploration required to compute a best-response, the best responses are different across iterations, which leads to instability in policy training.
  • Figure 4: We see HyPER and HyPE achieve the highest reward on the MuJoCo locomotion benchmark. Further, the performance gap increases with the difficulty of the environment (i.e. how far right a plot is in the above figure). We run all model-free algorithms for 1 million environment steps. Due to the higher interaction efficiency of model-based approaches, we only run HyPER for 150k environment steps, after which the last reward is extended horizontally across. We compute standard error across 5 seeds for HyPER, and across 10 seeds for all other algorithms.
  • Figure 5: Results on D4RL antmaze-large environment. All interactive baselines achieve 0 reward. While HyPE outperforms prior interactive methods, it does require resets in the environment to beat BC. HyPER is able to surpass BC without needing to reset to expert states and match BC performance with roughly 1/10th the amount of environment interaction that HyPE + Resets requires. Standard errors are reported across 5 seeds for all algorithms.
  • ...and 3 more figures

Theorems & Definitions (15)

  • Definition 3.1: $\mathsf{ERROr}\{\mathsf{Reg}_{\pi}(T)\}$
  • Definition 3.2
  • Theorem 3.3
  • Proof
  • Corollary 3.4: HyPE Performance Bound
  • Proof
  • Lemma 3.5
  • Corollary 3.6: HyPER Performance Bound
  • Proof
  • Lemma 1.1: Policy Evaluation Lemma, xie2020q
  • ...and 5 more