Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation
Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, Qiang Liu
TL;DR
This work tackles infinite-horizon off-policy evaluation, where trajectory-based importance sampling suffers from high variance and density-ratio methods can incur bias. It introduces a doubly robust estimator that combines a density-ratio–based estimator with a value-function–based estimator through a bridge term, guaranteeing accuracy if either component is correct. The authors establish a bias-robustness property, provide a variance analysis, and connect the estimator to a primal-dual Lagrangian formulation, offering a unified view of density and value learning. Empirical results on Taxi, Puck-Mountain, and InvertedPendulum demonstrate substantially reduced bias and competitive performance compared to prior infinite-horizon methods. Overall, the paper contributes a theoretically grounded, practically effective approach for reliable off-policy evaluation in long-horizon settings and opens avenues for joint primal-dual optimization of density and value functions.
Abstract
Infinite horizon off-policy policy evaluation is a highly challenging task due to the excessively large variance of typical importance sampling (IS) estimators. Recently, Liu et al. (2018a) proposed an approach that significantly reduces the variance of infinite-horizon off-policy evaluation by estimating the stationary density ratio, but at the cost of introducing potentially high biases due to the error in density ratio estimation. In this paper, we develop a bias-reduced augmentation of their method, which can take advantage of a learned value function to obtain higher accuracy. Our method is doubly robust in that the bias vanishes when either the density ratio or the value function estimation is perfect. In general, when either of them is accurate, the bias can also be reduced. Both theoretical and empirical results show that our method yields significant advantages over previous methods.
