Table of Contents
Fetching ...

Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

Nathan Kallus, Masatoshi Uehara

TL;DR

<3-5 sentence high-level summary>Off-policy evaluation in long-horizon reinforcement learning suffers from a curse of horizon due to diminishing overlap between behavior and target policies. The paper derives semiparametric efficiency bounds for three problem models (NMDP, TMDP, and time-invariant MDP) and shows that truly off-policy evaluation is feasible under the MDP structure with a single trajectory, unlike the non-Markov/time-variant cases. It introduces a first efficient, doubly robust estimator for infinite-horizon OPE under MDP by leveraging an efficient influence function and jointly estimating the density-ratio w(s) and the q-function q(s,a); cross-fitting permits slow, nonparametric nuisance-rate convergence. Theoretical results are complemented by experiments on Taxi and CartPole demonstrating that the DRL(𝓜_3) estimator achieves the efficiency bound and outperforms standard IS/DR/DM/MIS approaches, with valid confidence intervals.

Abstract

Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE using the efficient influence function we derive. Our DRL estimator simultaneously uses estimated stationary density ratios and $q$-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.

Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

TL;DR

<3-5 sentence high-level summary>Off-policy evaluation in long-horizon reinforcement learning suffers from a curse of horizon due to diminishing overlap between behavior and target policies. The paper derives semiparametric efficiency bounds for three problem models (NMDP, TMDP, and time-invariant MDP) and shows that truly off-policy evaluation is feasible under the MDP structure with a single trajectory, unlike the non-Markov/time-variant cases. It introduces a first efficient, doubly robust estimator for infinite-horizon OPE under MDP by leveraging an efficient influence function and jointly estimating the density-ratio w(s) and the q-function q(s,a); cross-fitting permits slow, nonparametric nuisance-rate convergence. Theoretical results are complemented by experiments on Taxi and CartPole demonstrating that the DRL(𝓜_3) estimator achieves the efficiency bound and outperforms standard IS/DR/DM/MIS approaches, with valid confidence intervals.

Abstract

Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE using the efficient influence function we derive. Our DRL estimator simultaneously uses estimated stationary density ratios and -functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.

Paper Structure

This paper contains 44 sections, 27 theorems, 170 equations, 19 figures, 2 tables.

Key Result

Theorem 1

Figures (19)

  • Figure 1: NMDP
  • Figure 2: TMDP
  • Figure 3: MDP
  • Figure 4: Bayes net representation of the independence structure of the truncated trajectory ending with $s_2$, $\mathcal{J}_{s_2}$, under the three models: NDMP, TMDP, and MDP. Conditional on its parents, a node is independent of all other nodes. The congruency sign origin=c]-45$|\space|$ indicates that the conditional probability function given parent nodes is equal.
  • Figure 5: Arrangement of folds for cross-fitting of nuisances for DRL in $\mathcal{M}_3$.
  • ...and 14 more figures

Theorems & Definitions (48)

  • Definition 1: NMDP models $\mathcal{M}_1,\mathcal{M}_{1,b}$
  • Definition 2: TMDP models $\mathcal{M}_2,\mathcal{M}_{2,b}$
  • Theorem 1: EB under NMDP
  • Theorem 2: EB under TMDP
  • Remark 1
  • Corollary 1: Sufficient conditions for existence of efficiency bounds
  • Remark 2: The curse of horizon in $\mathcal{M}_1$, extended
  • Remark 3: The curse of horizon in $\mathcal{M}_2$, a milder version of the original
  • Definition 3: MDP models $\mathcal{M}_3,\mathcal{M}_{3,b}$
  • Theorem 3: EB under MDP
  • ...and 38 more