On the Effective Horizon of Inverse Reinforcement Learning

Yiqing Xu; Finale Doshi-Velez; David Hsu

On the Effective Horizon of Inverse Reinforcement Learning

Yiqing Xu, Finale Doshi-Velez, David Hsu

TL;DR

This work investigates the effective horizon in inverse reinforcement learning (IRL), showing that a horizon shorter than the ground-truth can yield better generalization when expert data are scarce. It provides a formal analysis linking horizon to policy-class complexity and decomposes the learning error into reward-estimation and horizon-mismatch components, ultimately deriving a bound that favors an intermediate horizon. The authors propose jointly learning the reward and the effective horizon and validate the approach by extending LP-IRL and MaxEnt-IRL with cross-validation across four tasks, showing the optimal horizon is below the ground-truth and can be effectively identified by cross-validation. These results offer a principled adjust-and-learn paradigm for IRL with practical implications for data-efficient reward learning and policy optimization, with code and project pages publicly available.

Abstract

Inverse reinforcement learning (IRL) algorithms often rely on (forward) reinforcement learning or planning, over a given time horizon, to compute an approximately optimal policy for a hypothesized reward function; they then match this policy with expert demonstrations. The time horizon plays a critical role in determining both the accuracy of reward estimates and the computational efficiency of IRL algorithms. Interestingly, an *effective time horizon* shorter than the ground-truth value often produces better results faster. This work formally analyzes this phenomenon and provides an explanation: the time horizon controls the complexity of an induced policy class and mitigates overfitting with limited data. This analysis provides a guide for the principled choice of the effective horizon for IRL. It also prompts us to re-examine the classic IRL formulation: it is more natural to learn jointly the reward and the effective horizon rather than the reward alone with a given horizon. To validate our findings, we implement a cross-validation extension and the experimental results support the theoretical analysis. The project page and code are publicly available.

On the Effective Horizon of Inverse Reinforcement Learning

TL;DR

Abstract

Paper Structure (44 sections, 9 theorems, 51 equations, 6 figures, 2 algorithms)

This paper contains 44 sections, 9 theorems, 51 equations, 6 figures, 2 algorithms.

Introduction
Related works
Effective Horizon of Imitation Learning
Theoretical Analysis on Effective Horizon
Problem formulation
Analysis
Overview
Feasible Reward Function Set
Reward Function Estimation Error from Expert Policy Estimation Error
Expert Policy Estimation Error from Limited Data
Expected Value of the Expert Policy Estimation Error
Applying McDiarmid's Inequality
Uniform Bound on the Expect Policy Estimation Error
Policy Class Complexity Increases with $\widehat{\gamma}$
Error Decomposition and Deriving the Overall Bound
...and 29 more sections

Key Result

Theorem 4.1

Let $(S, A, P)$ be a controlled Markov process shared by two MDPs: the ground-truth MDP $(S, A, P, R_0, \gamma_{0})$ with reward function $R_0: S \times A \rightarrow [0, R_{\max}]$ and discount factor $\gamma_{0} \in (0, 1)$; and the estimated MDP $(S, A, P, \widehat{R}, \widehat{\gamma})$ with rew

Figures (6)

Figure 1: Summary of LP-IRL with varying discount factors across four tasks. The error counts measure the number of states for which a policy's action selection deviates from the expert's actions. Each task displays the ground-truth value function (column 1), reward function (column 2), expert policy (column 3), error count curves for different amount of expert data in a single instance (columns 4-8), and the error count curve summary for a batch of 10 MDPs across varying amount of expert data (column 9). In all four tasks, $\gamma_{0} = 0.99$. The optimal discount factor $\widehat{\gamma}^*<\gamma_{0}$ for varying amount of expert data. MaxEnt-IRL has similar curves in Figure \ref{['fig:maxent_envs']}.
Figure 2: Optimal $\widehat{\gamma}^*$ for LP-IRL at varying amount of expert data. For each task, we select $\widehat{\gamma}^*$ for all 10 sampled environments through cross-validation. The orange curves illustrate how the optimal discount factor $\widehat{\gamma}^*$ changes with the amount of expert data, while the green curves show the corresponding error counts. The ground-truth $\gamma_{0} = 0.99$ is depicted in grey, with its error counts displayed in blue. As the amount of expert data increases, $\widehat{\gamma}^*$ initially decreases sharply and then gradually increases, indicating that overfitting is prominent when expert data is scarce.
Figure 3: The cross-validation results for LP-IRL on four tasks are shown. The $x$-axis represents the amount of expert data; the $y$-axis shows policy error count differences. We compare discount factors $\widehat{\gamma}^*_{\text{cv}}$ (learned from cross-validation) and $\widehat{\gamma}^*_{\text{oracle}}$ (chosen by the oracle). Orange dots depict error differences between policies induced by $\widehat{\gamma}^*_{\text{cv}}$ and $\widehat{\gamma}^*_{\text{oracle}}$; blue dots show differences between policies induced by $\widehat{\gamma}^*_{\text{cv}}$ and the ground-truth $\gamma_{0}$. The orange curves near zero indicate that cross-validation effectively selects $\widehat{\gamma}^*$, while the positive blue curves show that cross-validation consistently yields better policies than using $\gamma_{0}$.
Figure 4: Summary of MaxEnt-IRL with different horizons for the four tasks. For each task, we present the ground-truth value function (column 1), ground-truth reward function (column 2), expert policy (column 3), error count curves in all states for different amount of expert data for a single instance of the task (columns 4-8), and finally, a summary of the error curves for a batch of 10 MDPs (column 9). In all four tasks, the ground-truth horizon $T_{0} = 20$. The optimal horizon $\widehat{T}^*<T_{0}$ for varying amount of expert data.
Figure 5: Optimal horizons ($\widehat{T}^*$) for MaxEnt-IRL at varying amount of expert data. We select the optimal $\widehat{T}^*$ for each of 10 sampled task environments using the algorithm in Section \ref{['sec:cross-val']}, based on the amount of expert data. Orange curves show how $\widehat{T}^*$ changes with the amount of expert data, while green curves display corresponding error counts. The ground-truth $T_{0} = 20$ is depicted by a grey line, with corresponding error lines in blue. The trends are consistent with LP-IRL.
...and 1 more figures

Theorems & Definitions (13)

Definition 3.1: Policy Class and Complexity Measure
Theorem 4.1
Definition 4.2: IRL Problem, adapted to the setting of varying discount factors
Lemma 4.3: Feasible Reward Function Set, extended from pmlr-v139-metelli21a
Theorem 4.4: Extension of Theorem 3.1 in pmlr-v139-metelli21a
Theorem 4.5
Theorem 4.6: Value Function Difference Bound
Definition A.1: Reward and Policy Equivalence
Lemma A.2: Potential-based Reward Shaping ng1999policy
Remark
...and 3 more

On the Effective Horizon of Inverse Reinforcement Learning

TL;DR

Abstract

On the Effective Horizon of Inverse Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (13)