Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

Olivier Jeunen; Shashank Gupta

Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

Olivier Jeunen, Shashank Gupta

TL;DR

The paper analyzes variance reduction in off-policy evaluation for ranking and recommendations. It proves that the optimal additive baseline in $\beta^{\star}$-IPS achieves asymptotic MSE dominance over SNIPS, and derives a precise variance gap showing SNIPS is suboptimal except in special cases. The results extend to ranking via the Item-Position Model, where position-wise $\beta_{\perp \perp}^{\star}$-IPS dominates SNIPM at every rank. Practically, this justifies replacing self-normalisation with optimal baseline corrections, while addressing finite-sample bias via cross-fitting as needed.

Abstract

Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that $β^\star$-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.

Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

TL;DR

The paper analyzes variance reduction in off-policy evaluation for ranking and recommendations. It proves that the optimal additive baseline in

-IPS achieves asymptotic MSE dominance over SNIPS, and derives a precise variance gap showing SNIPS is suboptimal except in special cases. The results extend to ranking via the Item-Position Model, where position-wise

-IPS dominates SNIPM at every rank. Practically, this justifies replacing self-normalisation with optimal baseline corrections, while addressing finite-sample bias via cross-fitting as needed.

Abstract

-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.

Paper Structure (9 sections, 3 theorems, 13 equations)

This paper contains 9 sections, 3 theorems, 13 equations.

Introduction & Motivation
Off-Policy Evaluation, Background & Notation
Theoretical Contributions
$\beta^{\star}$-IPS dominates SNIPS in MSE
$\beta^{\star}$-IPS reduces SNIPS' asymptotic variance
$\beta_{\perp \mkern-9.5mu \perp}^\star$-IPM dominates SNIPM at every position
A practical note on estimation bias for $\beta^{\star}$
Discussion
Conclusions

Key Result

theorem 1

Let $\mathcal{D}=\{(x_i,a_i,r_i)\}_{i=1}^n$ be i.i.d. logged data generated under a logging policy $\pi_0$. Let $w_i=\pi(a_i \mid x_i)/\pi_0(a_i \mid x_i)$ be the importance weight for a target policy $\pi$. Assume: Let $\beta^{\star}$ denote the baseline that minimises the mean squared error (MSE) within the family of estimators with a global additive control variate (cf. Eq. (38) in Gupta2024).

Theorems & Definitions (3)

theorem 1: Asymptotic MSE comparison of $\beta^{\star}$-IPS and SNIPS
proposition 1: Analytical Variance Gap
theorem 2: Asymptotic MSE comparison of $\beta_{\perp \mkern-9.5mu \perp}^{\star}$-IPM and SNIPM

Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

TL;DR

Abstract

Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (3)