$Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies
Olivier Jeunen, Aleksei Ustimenko
TL;DR
The paper introduces Δ-OPE, a pairwise off-policy evaluation framework that estimates improvements $V(\\pi_t) - V(\\pi_p)$ using data from a logging policy $\\pi_0$, achieving substantial variance reduction when production and target policies covary positively. It develops three estimators—$\\Delta{\\rm-IPS}$, $\\Delta{\\rm-SNIPS}$, and $\\Delta\\beta{\\rm-IPS}$—with a closed-form, variance-minimising additive control variate for the latter, and provides finite-sample unbiasedness under standard IPS assumptions.Through simulations and large-scale online experiments, the authors demonstrate improved estimation accuracy, tighter confidence intervals, and enhanced learning performance, with $\\Delta\\beta$-IPS often delivering the strongest gains. The work offers a practical, scalable path for efficient evaluation and policy improvement in recommender systems, and outlines avenues for extending Δ-OPE to doubly robust and ranking contexts.
Abstract
The off-policy paradigm casts recommendation as a counterfactual decision-making task, allowing practitioners to unbiasedly estimate online metrics using offline data. This leads to effective evaluation metrics, as well as learning procedures that directly optimise online success. Nevertheless, the high variance that comes with unbiasedness is typically the crux that complicates practical applications. An important insight is that the difference between policy values can often be estimated with significantly reduced variance, if said policies have positive covariance. This allows us to formulate a pairwise off-policy estimation task: $Δ\text{-}{\rm OPE}$. $Δ\text{-}{\rm OPE}$ subsumes the common use-case of estimating improvements of a learnt policy over a production policy, using data collected by a stochastic logging policy. We introduce $Δ\text{-}{\rm OPE}$ methods based on the widely used Inverse Propensity Scoring estimator and its extensions. Moreover, we characterise a variance-optimal additive control variate that further enhances efficiency. Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.
