Table of Contents
Fetching ...

Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

Olivier Jeunen, Shashank Gupta

TL;DR

The paper analyzes variance reduction in off-policy evaluation for ranking and recommendations. It proves that the optimal additive baseline in $\beta^{\star}$-IPS achieves asymptotic MSE dominance over SNIPS, and derives a precise variance gap showing SNIPS is suboptimal except in special cases. The results extend to ranking via the Item-Position Model, where position-wise $\beta_{\perp \perp}^{\star}$-IPS dominates SNIPM at every rank. Practically, this justifies replacing self-normalisation with optimal baseline corrections, while addressing finite-sample bias via cross-fitting as needed.

Abstract

Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that $β^\star$-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.

Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

TL;DR

The paper analyzes variance reduction in off-policy evaluation for ranking and recommendations. It proves that the optimal additive baseline in -IPS achieves asymptotic MSE dominance over SNIPS, and derives a precise variance gap showing SNIPS is suboptimal except in special cases. The results extend to ranking via the Item-Position Model, where position-wise -IPS dominates SNIPM at every rank. Practically, this justifies replacing self-normalisation with optimal baseline corrections, while addressing finite-sample bias via cross-fitting as needed.

Abstract

Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that -IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.
Paper Structure (9 sections, 3 theorems, 13 equations)

This paper contains 9 sections, 3 theorems, 13 equations.

Key Result

theorem 1

Let $\mathcal{D}=\{(x_i,a_i,r_i)\}_{i=1}^n$ be i.i.d. logged data generated under a logging policy $\pi_0$. Let $w_i=\pi(a_i \mid x_i)/\pi_0(a_i \mid x_i)$ be the importance weight for a target policy $\pi$. Assume: Let $\beta^{\star}$ denote the baseline that minimises the mean squared error (MSE) within the family of estimators with a global additive control variate (cf. Eq. (38) in Gupta2024).

Theorems & Definitions (3)

  • theorem 1: Asymptotic MSE comparison of $\beta^{\star}$-IPS and SNIPS
  • proposition 1: Analytical Variance Gap
  • theorem 2: Asymptotic MSE comparison of $\beta_{\perp \mkern-9.5mu \perp}^{\star}$-IPM and SNIPM