Table of Contents
Fetching ...

Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs

Yuheng Zhang, Nan Jiang

TL;DR

This paper tackles off-policy evaluation (OPE) for history-dependent policies in POMDPs with large observation spaces, revealing a fundamental hardness for model-free approaches and proposing a simple model-based method that achieves polynomial sample complexity under suitable coverage conditions. It introduces belief-state and outcome-revealing coverage as core assumptions and derives concrete bounds for a maximum-likelihood model-based estimator, showing a formal separation between model-free and model-based OPE in POMDPs. The results remain meaningful under both single-step and multi-step outcome revealing, and extend to scenarios with state-space misspecification via observable-equivalent realizability, highlighting robustness and limitations. The work advances the understanding of OPE in non-Markov environments and provides practical guidance for offline evaluation by favoring model-based strategies when history-dependent policies are involved.

Abstract

We investigate off-policy evaluation (OPE), a central and fundamental problem in reinforcement learning (RL), in the challenging setting of Partially Observable Markov Decision Processes (POMDPs) with large observation spaces. Recent works of Uehara et al. (2023a); Zhang & Jiang (2024) developed a model-free framework and identified important coverage assumptions (called belief and outcome coverage) that enable accurate OPE of memoryless policies with polynomial sample complexities, but handling more general target policies that depend on the entire observable history remained an open problem. In this work, we prove information-theoretic hardness for model-free OPE of history-dependent policies in several settings, characterized by additional assumptions imposed on the behavior policy (memoryless vs. history-dependent) and/or the state-revealing property of the POMDP (single-step vs. multi-step revealing). We further show that some hardness can be circumvented by a natural model-based algorithm -- whose analysis has surprisingly eluded the literature despite the algorithm's simplicity -- demonstrating provable separation between model-free and model-based OPE in POMDPs.

Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs

TL;DR

This paper tackles off-policy evaluation (OPE) for history-dependent policies in POMDPs with large observation spaces, revealing a fundamental hardness for model-free approaches and proposing a simple model-based method that achieves polynomial sample complexity under suitable coverage conditions. It introduces belief-state and outcome-revealing coverage as core assumptions and derives concrete bounds for a maximum-likelihood model-based estimator, showing a formal separation between model-free and model-based OPE in POMDPs. The results remain meaningful under both single-step and multi-step outcome revealing, and extend to scenarios with state-space misspecification via observable-equivalent realizability, highlighting robustness and limitations. The work advances the understanding of OPE in non-Markov environments and provides practical guidance for offline evaluation by favoring model-based strategies when history-dependent policies are involved.

Abstract

We investigate off-policy evaluation (OPE), a central and fundamental problem in reinforcement learning (RL), in the challenging setting of Partially Observable Markov Decision Processes (POMDPs) with large observation spaces. Recent works of Uehara et al. (2023a); Zhang & Jiang (2024) developed a model-free framework and identified important coverage assumptions (called belief and outcome coverage) that enable accurate OPE of memoryless policies with polynomial sample complexities, but handling more general target policies that depend on the entire observable history remained an open problem. In this work, we prove information-theoretic hardness for model-free OPE of history-dependent policies in several settings, characterized by additional assumptions imposed on the behavior policy (memoryless vs. history-dependent) and/or the state-revealing property of the POMDP (single-step vs. multi-step revealing). We further show that some hardness can be circumvented by a natural model-based algorithm -- whose analysis has surprisingly eluded the literature despite the algorithm's simplicity -- demonstrating provable separation between model-free and model-based OPE in POMDPs.

Paper Structure

This paper contains 38 sections, 10 theorems, 90 equations, 1 table.

Key Result

Theorem 1

Under Assumptions assum:belief and assum:multi, assume that $\pi_b$ and $\pi_e$ are memoryless, there exists an algorithm (see Appendix app:fdvf) such that, with probability at least $1-\delta$,Here we assume that the algorithm has knowledge of the value of $C_{\mathcal{F}}$ and $C_{\mathcal{H}}$. T

Theorems & Definitions (21)

  • Theorem 1: Corollary of Theorem 7 of zhang2024curses
  • Definition 2: Model-free algorithm
  • Theorem 3: Information-theoretic hardness of model-free algorithms
  • proof
  • Theorem 4
  • Theorem 5
  • Definition 6
  • Theorem 7
  • proof
  • Theorem 8
  • ...and 11 more