$Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies

Olivier Jeunen; Aleksei Ustimenko

$Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies

Olivier Jeunen, Aleksei Ustimenko

TL;DR

The paper introduces Δ-OPE, a pairwise off-policy evaluation framework that estimates improvements $V(\\pi_t) - V(\\pi_p)$ using data from a logging policy $\\pi_0$, achieving substantial variance reduction when production and target policies covary positively. It develops three estimators—$\\Delta{\\rm-IPS}$, $\\Delta{\\rm-SNIPS}$, and $\\Delta\\beta{\\rm-IPS}$—with a closed-form, variance-minimising additive control variate for the latter, and provides finite-sample unbiasedness under standard IPS assumptions.Through simulations and large-scale online experiments, the authors demonstrate improved estimation accuracy, tighter confidence intervals, and enhanced learning performance, with $\\Delta\\beta$-IPS often delivering the strongest gains. The work offers a practical, scalable path for efficient evaluation and policy improvement in recommender systems, and outlines avenues for extending Δ-OPE to doubly robust and ranking contexts.

Abstract

The off-policy paradigm casts recommendation as a counterfactual decision-making task, allowing practitioners to unbiasedly estimate online metrics using offline data. This leads to effective evaluation metrics, as well as learning procedures that directly optimise online success. Nevertheless, the high variance that comes with unbiasedness is typically the crux that complicates practical applications. An important insight is that the difference between policy values can often be estimated with significantly reduced variance, if said policies have positive covariance. This allows us to formulate a pairwise off-policy estimation task: $Δ\text{-}{\rm OPE}$. $Δ\text{-}{\rm OPE}$ subsumes the common use-case of estimating improvements of a learnt policy over a production policy, using data collected by a stochastic logging policy. We introduce $Δ\text{-}{\rm OPE}$ methods based on the widely used Inverse Propensity Scoring estimator and its extensions. Moreover, we characterise a variance-optimal additive control variate that further enhances efficiency. Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.

$Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies

TL;DR

The paper introduces Δ-OPE, a pairwise off-policy evaluation framework that estimates improvements

using data from a logging policy

, achieving substantial variance reduction when production and target policies covary positively. It develops three estimators—

, and

—with a closed-form, variance-minimising additive control variate for the latter, and provides finite-sample unbiasedness under standard IPS assumptions.Through simulations and large-scale online experiments, the authors demonstrate improved estimation accuracy, tighter confidence intervals, and enhanced learning performance, with

-IPS often delivering the strongest gains. The work offers a practical, scalable path for efficient evaluation and policy improvement in recommender systems, and outlines avenues for extending Δ-OPE to doubly robust and ranking contexts.

Abstract

subsumes the common use-case of estimating improvements of a learnt policy over a production policy, using data collected by a stochastic logging policy. We introduce

methods based on the widely used Inverse Propensity Scoring estimator and its extensions. Moreover, we characterise a variance-optimal additive control variate that further enhances efficiency. Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.

Paper Structure (11 sections, 19 equations, 1 figure, 1 table)

This paper contains 11 sections, 19 equations, 1 figure, 1 table.

Introduction & Motivation
Methodology & Contributions
Pairwise Off-Policy Estimation
Pairwise Inverse Propensity Scoring: $\Delta{\rm\text{-}IPS}$
Multiplicative Control Variates: $\Delta{\rm\text{-}SNIPS}$
Additive Control Variates: $\Delta\beta{\rm\text{-}IPS}$
Experiments & Discussion
Evaluation with discrete actions (RQ1)
Evaluation with continuous actions (RQ2)
Learning improved policies (RQ3)
Conclusions & Outlook

Figures (1)

Figure 1: The $\Delta\text{-}{\rm OPE}$ estimator family significantly improves performance, with $\Delta\beta\text{-}{\rm IPS}$ consistently performing best.

$Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies

TL;DR

Abstract

$Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (1)