On the Effective Horizon of Inverse Reinforcement Learning
Yiqing Xu, Finale Doshi-Velez, David Hsu
TL;DR
This work investigates the effective horizon in inverse reinforcement learning (IRL), showing that a horizon shorter than the ground-truth can yield better generalization when expert data are scarce. It provides a formal analysis linking horizon to policy-class complexity and decomposes the learning error into reward-estimation and horizon-mismatch components, ultimately deriving a bound that favors an intermediate horizon. The authors propose jointly learning the reward and the effective horizon and validate the approach by extending LP-IRL and MaxEnt-IRL with cross-validation across four tasks, showing the optimal horizon is below the ground-truth and can be effectively identified by cross-validation. These results offer a principled adjust-and-learn paradigm for IRL with practical implications for data-efficient reward learning and policy optimization, with code and project pages publicly available.
Abstract
Inverse reinforcement learning (IRL) algorithms often rely on (forward) reinforcement learning or planning, over a given time horizon, to compute an approximately optimal policy for a hypothesized reward function; they then match this policy with expert demonstrations. The time horizon plays a critical role in determining both the accuracy of reward estimates and the computational efficiency of IRL algorithms. Interestingly, an *effective time horizon* shorter than the ground-truth value often produces better results faster. This work formally analyzes this phenomenon and provides an explanation: the time horizon controls the complexity of an induced policy class and mitigates overfitting with limited data. This analysis provides a guide for the principled choice of the effective horizon for IRL. It also prompts us to re-examine the classic IRL formulation: it is more natural to learn jointly the reward and the effective horizon rather than the reward alone with a given horizon. To validate our findings, we implement a cross-validation extension and the experimental results support the theoretical analysis. The project page and code are publicly available.
