Table of Contents
Fetching ...

Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation

Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, Qiang Liu

TL;DR

This work tackles infinite-horizon off-policy evaluation, where trajectory-based importance sampling suffers from high variance and density-ratio methods can incur bias. It introduces a doubly robust estimator that combines a density-ratio–based estimator with a value-function–based estimator through a bridge term, guaranteeing accuracy if either component is correct. The authors establish a bias-robustness property, provide a variance analysis, and connect the estimator to a primal-dual Lagrangian formulation, offering a unified view of density and value learning. Empirical results on Taxi, Puck-Mountain, and InvertedPendulum demonstrate substantially reduced bias and competitive performance compared to prior infinite-horizon methods. Overall, the paper contributes a theoretically grounded, practically effective approach for reliable off-policy evaluation in long-horizon settings and opens avenues for joint primal-dual optimization of density and value functions.

Abstract

Infinite horizon off-policy policy evaluation is a highly challenging task due to the excessively large variance of typical importance sampling (IS) estimators. Recently, Liu et al. (2018a) proposed an approach that significantly reduces the variance of infinite-horizon off-policy evaluation by estimating the stationary density ratio, but at the cost of introducing potentially high biases due to the error in density ratio estimation. In this paper, we develop a bias-reduced augmentation of their method, which can take advantage of a learned value function to obtain higher accuracy. Our method is doubly robust in that the bias vanishes when either the density ratio or the value function estimation is perfect. In general, when either of them is accurate, the bias can also be reduced. Both theoretical and empirical results show that our method yields significant advantages over previous methods.

Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation

TL;DR

This work tackles infinite-horizon off-policy evaluation, where trajectory-based importance sampling suffers from high variance and density-ratio methods can incur bias. It introduces a doubly robust estimator that combines a density-ratio–based estimator with a value-function–based estimator through a bridge term, guaranteeing accuracy if either component is correct. The authors establish a bias-robustness property, provide a variance analysis, and connect the estimator to a primal-dual Lagrangian formulation, offering a unified view of density and value learning. Empirical results on Taxi, Puck-Mountain, and InvertedPendulum demonstrate substantially reduced bias and competitive performance compared to prior infinite-horizon methods. Overall, the paper contributes a theoretically grounded, practically effective approach for reliable off-policy evaluation in long-horizon settings and opens avenues for joint primal-dual optimization of density and value functions.

Abstract

Infinite horizon off-policy policy evaluation is a highly challenging task due to the excessively large variance of typical importance sampling (IS) estimators. Recently, Liu et al. (2018a) proposed an approach that significantly reduces the variance of infinite-horizon off-policy evaluation by estimating the stationary density ratio, but at the cost of introducing potentially high biases due to the error in density ratio estimation. In this paper, we develop a bias-reduced augmentation of their method, which can take advantage of a learned value function to obtain higher accuracy. Our method is doubly robust in that the bias vanishes when either the density ratio or the value function estimation is perfect. In general, when either of them is accurate, the bias can also be reduced. Both theoretical and empirical results show that our method yields significant advantages over previous methods.

Paper Structure

This paper contains 39 sections, 6 theorems, 52 equations, 3 figures, 3 algorithms.

Key Result

Theorem 3.1

Let $R^\pi_{\text{DR}}[\widehat{V}, \widehat{w}] := \lim_{n_0,n,T\to \infty}\widehat{R}^\pi_{\text{DR}}[\widehat{V}, \widehat{w}]$ be the limit of $\widehat{R}^\pi_{\text{DR}}$ when it has infinite samples. Following the definition above, we have where $\varepsilon_{\widehat{V}}$ and $\varepsilon_{\widehat{w}}$ are errors of $\widehat{V}$ and $\widehat{w}$, respective, defined as follows The err

Figures (3)

  • Figure 1: Off Policy Evaluation Results on Taxi. Default parameter, discounted factor $\gamma = 0.99$, mixed ratio $\alpha = \beta = 1$, horizon length $H = 600$. For (a)-(c) the x-axis is the number of trajectories and y-axis corresponds to MSE, Bias Square and Variance, respectively. For (d) we fix the total number of samples (number of trajectories times horizon length) and change the horizon length as x-axis and observe the MSE. (e) and (f) show the change the mixed ratio of $\alpha$, $\beta$ with the change of bias. We repeat each experiment for 1000 runs.
  • Figure 2: Off Policy Evaluation Results on Puck-Mountain. We set discounted factor $\gamma = 0.995$ as default. For (a)-(c) we set the horizon $H = 1000$ and the x-axis is the number of trajectories for used for evaluation. For (d) we fix the total number of samples and change the horizon length.
  • Figure 3: Off Policy Evaluation Results on InvertedPendulum-v2. We set discounted factor $\gamma = 0.995$ as default. For (a)-(c) we set the horizon $H = 1000$ and the x-axis is the number of trajectories for used for evaluation. For (d) we fix the total number of samples and change the horizon length.

Theorems & Definitions (13)

  • Theorem 3.1: Double Robustness
  • Theorem 3.2: Variance Analysis
  • Theorem 4.1
  • Definition A.1
  • Lemma A.2
  • proof
  • proof
  • Theorem A.3
  • proof
  • proof
  • ...and 3 more