Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation
Olivier Jeunen, Shashank Gupta
TL;DR
The paper analyzes variance reduction in off-policy evaluation for ranking and recommendations. It proves that the optimal additive baseline in $\beta^{\star}$-IPS achieves asymptotic MSE dominance over SNIPS, and derives a precise variance gap showing SNIPS is suboptimal except in special cases. The results extend to ranking via the Item-Position Model, where position-wise $\beta_{\perp \perp}^{\star}$-IPS dominates SNIPM at every rank. Practically, this justifies replacing self-normalisation with optimal baseline corrections, while addressing finite-sample bias via cross-fitting as needed.
Abstract
Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that $β^\star$-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.
