Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs
Yuheng Zhang, Nan Jiang
TL;DR
This paper tackles off-policy evaluation (OPE) for history-dependent policies in POMDPs with large observation spaces, revealing a fundamental hardness for model-free approaches and proposing a simple model-based method that achieves polynomial sample complexity under suitable coverage conditions. It introduces belief-state and outcome-revealing coverage as core assumptions and derives concrete bounds for a maximum-likelihood model-based estimator, showing a formal separation between model-free and model-based OPE in POMDPs. The results remain meaningful under both single-step and multi-step outcome revealing, and extend to scenarios with state-space misspecification via observable-equivalent realizability, highlighting robustness and limitations. The work advances the understanding of OPE in non-Markov environments and provides practical guidance for offline evaluation by favoring model-based strategies when history-dependent policies are involved.
Abstract
We investigate off-policy evaluation (OPE), a central and fundamental problem in reinforcement learning (RL), in the challenging setting of Partially Observable Markov Decision Processes (POMDPs) with large observation spaces. Recent works of Uehara et al. (2023a); Zhang & Jiang (2024) developed a model-free framework and identified important coverage assumptions (called belief and outcome coverage) that enable accurate OPE of memoryless policies with polynomial sample complexities, but handling more general target policies that depend on the entire observable history remained an open problem. In this work, we prove information-theoretic hardness for model-free OPE of history-dependent policies in several settings, characterized by additional assumptions imposed on the behavior policy (memoryless vs. history-dependent) and/or the state-revealing property of the POMDP (single-step vs. multi-step revealing). We further show that some hardness can be circumvented by a natural model-based algorithm -- whose analysis has surprisingly eluded the literature despite the algorithm's simplicity -- demonstrating provable separation between model-free and model-based OPE in POMDPs.
