Table of Contents
Fetching ...

Low Variance Off-policy Evaluation with State-based Importance Sampling

David M. Bossens, Philip S. Thomas

TL;DR

The paper tackles the problem of high-variance off-policy evaluation (OPE) in reinforcement learning, which worsens exponentially with horizon length. It introduces state-based importance sampling (SIS), a general technique that drops selected states from the importance-weight product to reduce variance while controlling bias. Two negligibility-detection methods are proposed—covariance testing and Q-value based identification—and SIS is instantiated across multiple OPE estimators (ordinary IS, WIS, PDIS, INCRIS, DR, and SDRE/MIS). Empirical results across four domains (lift variants, inventory management, and taxi) show consistent variance reductions and improved accuracy, with notable gains from correctly identifying negligible states. The work provides a practical, plug-in approach to stabilize OPE and enable safer policy evaluation and improvement in complex domains.

Abstract

In many domains, the exploration process of reinforcement learning will be too costly as it requires trying out suboptimal policies, resulting in a need for off-policy evaluation, in which a target policy is evaluated based on data collected from a known behaviour policy. In this context, importance sampling estimators provide estimates for the expected return by weighting the trajectory based on the probability ratio of the target policy and the behaviour policy. Unfortunately, such estimators have a high variance and therefore a large mean squared error. This paper proposes state-based importance sampling estimators which reduce the variance by dropping certain states from the computation of the importance weight. To illustrate their applicability, we demonstrate state-based variants of ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stationary density ratio estimation. Experiments in four domains show that state-based methods consistently yield reduced variance and improved accuracy compared to their traditional counterparts.

Low Variance Off-policy Evaluation with State-based Importance Sampling

TL;DR

The paper tackles the problem of high-variance off-policy evaluation (OPE) in reinforcement learning, which worsens exponentially with horizon length. It introduces state-based importance sampling (SIS), a general technique that drops selected states from the importance-weight product to reduce variance while controlling bias. Two negligibility-detection methods are proposed—covariance testing and Q-value based identification—and SIS is instantiated across multiple OPE estimators (ordinary IS, WIS, PDIS, INCRIS, DR, and SDRE/MIS). Empirical results across four domains (lift variants, inventory management, and taxi) show consistent variance reductions and improved accuracy, with notable gains from correctly identifying negligible states. The work provides a practical, plug-in approach to stabilize OPE and enable safer policy evaluation and improvement in complex domains.

Abstract

In many domains, the exploration process of reinforcement learning will be too costly as it requires trying out suboptimal policies, resulting in a need for off-policy evaluation, in which a target policy is evaluated based on data collected from a known behaviour policy. In this context, importance sampling estimators provide estimates for the expected return by weighting the trajectory based on the probability ratio of the target policy and the behaviour policy. Unfortunately, such estimators have a high variance and therefore a large mean squared error. This paper proposes state-based importance sampling estimators which reduce the variance by dropping certain states from the computation of the importance weight. To illustrate their applicability, we demonstrate state-based variants of ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stationary density ratio estimation. Experiments in four domains show that state-based methods consistently yield reduced variance and improved accuracy compared to their traditional counterparts.
Paper Structure (17 sections, 21 equations, 3 figures, 7 tables)

This paper contains 17 sections, 21 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Residuals ($y$-axis), defined as $\hat{G} - \mathcal{G}$, as a function of the domain size ($x$-axis) in the lift domains. Estimates are based on 1,000 episodes. Residuals are represented by their mean $\pm$ standard error over 50 independent runs for the deterministic lift domain and over 200 independent runs for the stochastic domain. Note that SDRE and SSDRE are not included for improved visibility as their residuals are extremely large.
  • Figure 2: Residuals ($y$-axis), defined as $\hat{G} - \mathcal{G}$, as a function of the number of episodes ($x$-axis) in the inventory management domain. Residuals are represented by their mean $\pm$ standard error over 50 independent runs.
  • Figure 3: Residuals ($y$-axis), defined as $\hat{G} - \mathcal{G}$, as a function of the effective horizon ($H$; $x$-axis) of the taxi domain. Residuals are represented by their mean $\pm$ standard error over 20 independent runs.

Theorems & Definitions (2)

  • Definition 1
  • Definition 2