Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation
Hongyi Zhou, Josiah P. Hanna, Jin Zhu, Ying Yang, Chengchun Shi
TL;DR
This paper tackles off-policy evaluation (OPE) when the behavior policy is estimated from data, addressing the paradox that history-dependent behavior policies can reduce IS-based MSE even when the true policy is Markov. It develops a bias-variance decomposition for ordinary IS (OIS) and extends the analysis to sequential IS (SIS), doubly robust (DR), and marginalized IS (MIS) estimators, showing that longer history can reduce asymptotic variance but increases finite-sample bias. The authors also extend these results to nonparametric (sieve) estimators and demonstrate the findings through numerical experiments in CartPole and other MuJoCo environments, highlighting a consistent large-sample gain from history while warning about finite-sample bias and MIS pitfalls. Finally, the paper offers practical guidelines for choosing history length via a variance-based criterion and discusses implications for OPE practice and future extension to broader settings.
Abstract
This paper studies off-policy evaluation (OPE) in reinforcement learning with a focus on behavior policy estimation for importance sampling. Prior work has shown empirically that estimating a history-dependent behavior policy can lead to lower mean squared error (MSE) even when the true behavior policy is Markovian. However, the question of why the use of history should lower MSE remains open. In this paper, we theoretically demystify this paradox by deriving a bias-variance decomposition of the MSE of ordinary importance sampling (IS) estimators, demonstrating that history-dependent behavior policy estimation decreases their asymptotic variances while increasing their finite-sample biases. Additionally, as the estimated behavior policy conditions on a longer history, we show a consistent decrease in variance. We extend these findings to a range of other OPE estimators, including the sequential IS estimator, the doubly robust estimator and the marginalized IS estimator, with the behavior policy estimated either parametrically or non-parametrically.
