Table of Contents
Fetching ...

On the Effective Horizon of Inverse Reinforcement Learning

Yiqing Xu, Finale Doshi-Velez, David Hsu

TL;DR

This work investigates the effective horizon in inverse reinforcement learning (IRL), showing that a horizon shorter than the ground-truth can yield better generalization when expert data are scarce. It provides a formal analysis linking horizon to policy-class complexity and decomposes the learning error into reward-estimation and horizon-mismatch components, ultimately deriving a bound that favors an intermediate horizon. The authors propose jointly learning the reward and the effective horizon and validate the approach by extending LP-IRL and MaxEnt-IRL with cross-validation across four tasks, showing the optimal horizon is below the ground-truth and can be effectively identified by cross-validation. These results offer a principled adjust-and-learn paradigm for IRL with practical implications for data-efficient reward learning and policy optimization, with code and project pages publicly available.

Abstract

Inverse reinforcement learning (IRL) algorithms often rely on (forward) reinforcement learning or planning, over a given time horizon, to compute an approximately optimal policy for a hypothesized reward function; they then match this policy with expert demonstrations. The time horizon plays a critical role in determining both the accuracy of reward estimates and the computational efficiency of IRL algorithms. Interestingly, an *effective time horizon* shorter than the ground-truth value often produces better results faster. This work formally analyzes this phenomenon and provides an explanation: the time horizon controls the complexity of an induced policy class and mitigates overfitting with limited data. This analysis provides a guide for the principled choice of the effective horizon for IRL. It also prompts us to re-examine the classic IRL formulation: it is more natural to learn jointly the reward and the effective horizon rather than the reward alone with a given horizon. To validate our findings, we implement a cross-validation extension and the experimental results support the theoretical analysis. The project page and code are publicly available.

On the Effective Horizon of Inverse Reinforcement Learning

TL;DR

This work investigates the effective horizon in inverse reinforcement learning (IRL), showing that a horizon shorter than the ground-truth can yield better generalization when expert data are scarce. It provides a formal analysis linking horizon to policy-class complexity and decomposes the learning error into reward-estimation and horizon-mismatch components, ultimately deriving a bound that favors an intermediate horizon. The authors propose jointly learning the reward and the effective horizon and validate the approach by extending LP-IRL and MaxEnt-IRL with cross-validation across four tasks, showing the optimal horizon is below the ground-truth and can be effectively identified by cross-validation. These results offer a principled adjust-and-learn paradigm for IRL with practical implications for data-efficient reward learning and policy optimization, with code and project pages publicly available.

Abstract

Inverse reinforcement learning (IRL) algorithms often rely on (forward) reinforcement learning or planning, over a given time horizon, to compute an approximately optimal policy for a hypothesized reward function; they then match this policy with expert demonstrations. The time horizon plays a critical role in determining both the accuracy of reward estimates and the computational efficiency of IRL algorithms. Interestingly, an *effective time horizon* shorter than the ground-truth value often produces better results faster. This work formally analyzes this phenomenon and provides an explanation: the time horizon controls the complexity of an induced policy class and mitigates overfitting with limited data. This analysis provides a guide for the principled choice of the effective horizon for IRL. It also prompts us to re-examine the classic IRL formulation: it is more natural to learn jointly the reward and the effective horizon rather than the reward alone with a given horizon. To validate our findings, we implement a cross-validation extension and the experimental results support the theoretical analysis. The project page and code are publicly available.
Paper Structure (44 sections, 9 theorems, 51 equations, 6 figures, 2 algorithms)

This paper contains 44 sections, 9 theorems, 51 equations, 6 figures, 2 algorithms.

Key Result

Theorem 4.1

Let $(S, A, P)$ be a controlled Markov process shared by two MDPs: the ground-truth MDP $(S, A, P, R_0, \gamma_{0})$ with reward function $R_0: S \times A \rightarrow [0, R_{\max}]$ and discount factor $\gamma_{0} \in (0, 1)$; and the estimated MDP $(S, A, P, \widehat{R}, \widehat{\gamma})$ with rew

Figures (6)

  • Figure 1: Summary of LP-IRL with varying discount factors across four tasks. The error counts measure the number of states for which a policy's action selection deviates from the expert's actions. Each task displays the ground-truth value function (column 1), reward function (column 2), expert policy (column 3), error count curves for different amount of expert data in a single instance (columns 4-8), and the error count curve summary for a batch of 10 MDPs across varying amount of expert data (column 9). In all four tasks, $\gamma_{0} = 0.99$. The optimal discount factor $\widehat{\gamma}^*<\gamma_{0}$ for varying amount of expert data. MaxEnt-IRL has similar curves in Figure \ref{['fig:maxent_envs']}.
  • Figure 2: Optimal $\widehat{\gamma}^*$ for LP-IRL at varying amount of expert data. For each task, we select $\widehat{\gamma}^*$ for all 10 sampled environments through cross-validation. The orange curves illustrate how the optimal discount factor $\widehat{\gamma}^*$ changes with the amount of expert data, while the green curves show the corresponding error counts. The ground-truth $\gamma_{0} = 0.99$ is depicted in grey, with its error counts displayed in blue. As the amount of expert data increases, $\widehat{\gamma}^*$ initially decreases sharply and then gradually increases, indicating that overfitting is prominent when expert data is scarce.
  • Figure 3: The cross-validation results for LP-IRL on four tasks are shown. The $x$-axis represents the amount of expert data; the $y$-axis shows policy error count differences. We compare discount factors $\widehat{\gamma}^*_{\text{cv}}$ (learned from cross-validation) and $\widehat{\gamma}^*_{\text{oracle}}$ (chosen by the oracle). Orange dots depict error differences between policies induced by $\widehat{\gamma}^*_{\text{cv}}$ and $\widehat{\gamma}^*_{\text{oracle}}$; blue dots show differences between policies induced by $\widehat{\gamma}^*_{\text{cv}}$ and the ground-truth $\gamma_{0}$. The orange curves near zero indicate that cross-validation effectively selects $\widehat{\gamma}^*$, while the positive blue curves show that cross-validation consistently yields better policies than using $\gamma_{0}$.
  • Figure 4: Summary of MaxEnt-IRL with different horizons for the four tasks. For each task, we present the ground-truth value function (column 1), ground-truth reward function (column 2), expert policy (column 3), error count curves in all states for different amount of expert data for a single instance of the task (columns 4-8), and finally, a summary of the error curves for a batch of 10 MDPs (column 9). In all four tasks, the ground-truth horizon $T_{0} = 20$. The optimal horizon $\widehat{T}^*<T_{0}$ for varying amount of expert data.
  • Figure 5: Optimal horizons ($\widehat{T}^*$) for MaxEnt-IRL at varying amount of expert data. We select the optimal $\widehat{T}^*$ for each of 10 sampled task environments using the algorithm in Section \ref{['sec:cross-val']}, based on the amount of expert data. Orange curves show how $\widehat{T}^*$ changes with the amount of expert data, while green curves display corresponding error counts. The ground-truth $T_{0} = 20$ is depicted by a grey line, with corresponding error lines in blue. The trends are consistent with LP-IRL.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Definition 3.1: Policy Class and Complexity Measure
  • Theorem 4.1
  • Definition 4.2: IRL Problem, adapted to the setting of varying discount factors
  • Lemma 4.3: Feasible Reward Function Set, extended from pmlr-v139-metelli21a
  • Theorem 4.4: Extension of Theorem 3.1 in pmlr-v139-metelli21a
  • Theorem 4.5
  • Theorem 4.6: Value Function Difference Bound
  • Definition A.1: Reward and Policy Equivalence
  • Lemma A.2: Potential-based Reward Shaping ng1999policy
  • Remark
  • ...and 3 more