Low Variance Off-policy Evaluation with State-based Importance Sampling
David M. Bossens, Philip S. Thomas
TL;DR
The paper tackles the problem of high-variance off-policy evaluation (OPE) in reinforcement learning, which worsens exponentially with horizon length. It introduces state-based importance sampling (SIS), a general technique that drops selected states from the importance-weight product to reduce variance while controlling bias. Two negligibility-detection methods are proposed—covariance testing and Q-value based identification—and SIS is instantiated across multiple OPE estimators (ordinary IS, WIS, PDIS, INCRIS, DR, and SDRE/MIS). Empirical results across four domains (lift variants, inventory management, and taxi) show consistent variance reductions and improved accuracy, with notable gains from correctly identifying negligible states. The work provides a practical, plug-in approach to stabilize OPE and enable safer policy evaluation and improvement in complex domains.
Abstract
In many domains, the exploration process of reinforcement learning will be too costly as it requires trying out suboptimal policies, resulting in a need for off-policy evaluation, in which a target policy is evaluated based on data collected from a known behaviour policy. In this context, importance sampling estimators provide estimates for the expected return by weighting the trajectory based on the probability ratio of the target policy and the behaviour policy. Unfortunately, such estimators have a high variance and therefore a large mean squared error. This paper proposes state-based importance sampling estimators which reduce the variance by dropping certain states from the computation of the importance weight. To illustrate their applicability, we demonstrate state-based variants of ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stationary density ratio estimation. Experiments in four domains show that state-based methods consistently yield reduced variance and improved accuracy compared to their traditional counterparts.
