Hybrid Inverse Reinforcement Learning
Juntao Ren, Gokul Swamy, Zhiwei Steven Wu, J. Andrew Bagnell, Sanjiban Choudhury
TL;DR
This work addresses the sample-inefficiency of inverse reinforcement learning by introducing Hybrid Inverse Reinforcement Learning, a reduction to expert-competitive RL that uses expert data within the policy search. It presents two algorithms, HyPE (model-free) and HyPER (model-based), which leverage a mixture of expert and learner data to dramatically reduce inner-loop exploration while maintaining performance guarantees. The authors formalize Expert-Relative Regret Oracles (ERROr) and demonstrate both theoretical guarantees and empirical gains on continuous-control benchmarks, including MuJoCo and D4RL antmaze. The approach offers flexible trade-offs depending on environment access and model availability, providing a practical path to more sample-efficient imitation learning in robotics and related domains.
Abstract
The inverse reinforcement learning approach to imitation learning is a double-edged sword. On the one hand, it can enable learning from a smaller number of expert demonstrations with more robustness to error compounding than behavioral cloning approaches. On the other hand, it requires that the learner repeatedly solve a computationally expensive reinforcement learning (RL) problem. Often, much of this computation is wasted searching over policies very dissimilar to the expert's. In this work, we propose using hybrid RL -- training on a mixture of online and expert data -- to curtail unnecessary exploration. Intuitively, the expert data focuses the learner on good states during training, which reduces the amount of exploration required to compute a strong policy. Notably, such an approach doesn't need the ability to reset the learner to arbitrary states in the environment, a requirement of prior work in efficient inverse RL. More formally, we derive a reduction from inverse RL to expert-competitive RL (rather than globally optimal RL) that allows us to dramatically reduce interaction during the inner policy search loop while maintaining the benefits of the IRL approach. This allows us to derive both model-free and model-based hybrid inverse RL algorithms with strong policy performance guarantees. Empirically, we find that our approaches are significantly more sample efficient than standard inverse RL and several other baselines on a suite of continuous control tasks.
