Long-term Off-Policy Evaluation and Learning

Yuta Saito; Himan Abdollahpouri; Jesse Anderton; Ben Carterette; Mounia Lalmas

Long-term Off-Policy Evaluation and Learning

Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, Mounia Lalmas

TL;DR

The paper tackles estimating long-term outcomes of algorithm changes without lengthy online experiments by introducing Long-term Off-Policy Evaluation (LOPE).LOPE decomposes the long-term reward into a surrogate part explainable by short-term rewards and an action-specific part, enabling variance reduction through surrogate importance weights and a regression model for the action effect.It provides theoretical guarantees of unbiasedness under weaker conditions than surrogacy and demonstrates substantial variance reduction over standard OPE methods, including in policy learning scenarios via LOPE-PG.Empirical results on synthetic data and a large-scale real-world music platform show LOPE outperforms Long-term Causal Inference and standard OPE in estimation accuracy, policy selection, and policy learning, especially under high long-term reward noise.

Abstract

Short- and long-term outcomes of an algorithm often differ, with damaging downstream effects. A known example is a click-bait algorithm, which may increase short-term clicks but damage long-term user engagement. A possible solution to estimate the long-term outcome is to run an online experiment or A/B test for the potential algorithms, but it takes months or even longer to observe the long-term outcomes of interest, making the algorithm selection process unacceptably slow. This work thus studies the problem of feasibly yet accurately estimating the long-term outcome of an algorithm using only historical and short-term experiment data. Existing approaches to this problem either need a restrictive assumption about the short-term outcomes called surrogacy or cannot effectively use short-term outcomes, which is inefficient. Therefore, we propose a new framework called Long-term Off-Policy Evaluation (LOPE), which is based on reward function decomposition. LOPE works under a more relaxed assumption than surrogacy and effectively leverages short-term rewards to substantially reduce the variance. Synthetic experiments show that LOPE outperforms existing approaches particularly when surrogacy is severely violated and the long-term reward is noisy. In addition, real-world experiments on large-scale A/B test data collected on a music streaming platform show that LOPE can estimate the long-term outcome of actual algorithms more accurately than existing feasible methods.

Long-term Off-Policy Evaluation and Learning

TL;DR

Abstract

Paper Structure (26 sections, 4 theorems, 26 equations, 8 figures, 3 tables)

This paper contains 26 sections, 4 theorems, 26 equations, 8 figures, 3 tables.

Introduction
Problem Formulation
Long-term Experiment
Long-term Causal Inference (LCI)
Typical Off-Policy Evaluation (OPE)
Long-term Off-Policy Evaluation
The LOPE Estimator
Estimating Surrogate Importance Weights
Extension to Policy Learning
Synthetic Experiments
Synthetic Data Setup.
Results in Policy Evaluation and Selection.
Results in Policy Learning.
Real-World Experiment
Conclusion and Future Work
...and 11 more sections

Key Result

theorem 1

LOPE is unbiased, i.e., $\mathbb{E}_{\mathcal{D}_H}[\hat{V}_{\mathrm{LOPE}} (\pi_1; \mathcal{D}_H)] = V(\pi_1)$, if either of the following holds true:

Figures (8)

Figure 1: The statistical problem of estimating the long-term outcomes using historical and short-term experiment data
Figure 2: The Surrogacy Assumption
Figure 3: Reward decomposition employed by LOPE.
Figure 4: The surrogate importance weight of LOPE to estimate the surrogate effect (the $g$ function in Eq. \ref{['eq:decomposition']}) is computed based on the marginal distributions of the short-term reward induced by two different policies, $\pi_1$ and $\pi_0$.
Figure 5:
...and 3 more figures

Theorems & Definitions (4)

theorem 1
theorem 2
theorem 3
corollary 1

Long-term Off-Policy Evaluation and Learning

TL;DR

Abstract

Long-term Off-Policy Evaluation and Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (4)