Cross-Validated Off-Policy Evaluation

Matej Cief; Branislav Kveton; Michal Kompan

Cross-Validated Off-Policy Evaluation

Matej Cief, Branislav Kveton, Michal Kompan

TL;DR

Problem: estimating $V(\pi)$ in off-policy settings where ground-truth policy value is unknown, and selecting estimators or tuning hyper-parameters is challenging. Approach: adapt cross-validation to OPE by using an unbiased validator on held-out data and a training/validation split strategy, culminating in the Off-Policy Cross-Validation (OCV) algorithm with a one-standard-error rule. Contributions: a general, simple, and efficient framework for estimator selection and hyper-parameter tuning in OPE, validated on nine real datasets and outperforming state-of-the-art baselines. Significance: enables data-driven, robust OPE in contexts where online experimentation is costly or risky, with potential extensions to offline policy learning and reinforcement learning.

Abstract

We study estimator selection and hyper-parameter tuning in off-policy evaluation. Although cross-validation is the most popular method for model selection in supervised learning, off-policy evaluation relies mostly on theory, which provides only limited guidance to practitioners. We show how to use cross-validation for off-policy evaluation. This challenges a popular belief that cross-validation in off-policy evaluation is not feasible. We evaluate our method empirically and show that it addresses a variety of use cases.

Cross-Validated Off-Policy Evaluation

TL;DR

Problem: estimating

in off-policy settings where ground-truth policy value is unknown, and selecting estimators or tuning hyper-parameters is challenging. Approach: adapt cross-validation to OPE by using an unbiased validator on held-out data and a training/validation split strategy, culminating in the Off-Policy Cross-Validation (OCV) algorithm with a one-standard-error rule. Contributions: a general, simple, and efficient framework for estimator selection and hyper-parameter tuning in OPE, validated on nine real datasets and outperforming state-of-the-art baselines. Significance: enables data-driven, robust OPE in contexts where online experimentation is costly or risky, with potential extensions to offline policy learning and reinforcement learning.

Abstract

Paper Structure (30 sections, 2 theorems, 38 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 30 sections, 2 theorems, 38 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Off-Policy Evaluation
Related Work
Cross-Validation in Machine Learning
Off-Policy Cross-Validation
Analysis
One Standard Error Rule
Algorithm
Experiments
Estimator Selection
Hyper-Parameter Tuning
Conclusion
Implementation Details
Datasets
Estimators Tuned in the Experiments
...and 15 more sections

Key Result

Theorem 1

For any split $k \in [K]$,

Figures (7)

Figure 1: MSE of our estimator selection methods, $\tt OCV_\textsc{IPS}$ and $\tt OCV_\textsc{DR}$, compared against two other estimator selection baselines, Slope and $\tt PAS- IF$. The methods select the best estimator out of IPS, DM, and DR. In all figures, we report $95\%$ confidence intervals estimated by bootstrapping.
Figure 2: MSE of the methods for temperatures $\beta_0 = 1$ and $\beta_1 = -10$. $\tt OCV$ performs well even when its validator does not, for example $\tt OCV_\textsc{DR}$ on the glass dataset. This also shows that $\tt OCV$ does not simply choose the same estimator as the validator.
Figure 3: MSE of our estimator selection methods and specialized theoretical approaches applied to hyper-parameter tuning of various estimators. Everything refers to the joint estimator selection and hyper-parameter tuning. This shows that $\tt OCV$ is a reliable and practical method for choosing a suitable and well-tuned estimator.
Figure 4: Ablation on proposed improvements from \ref{['sec: analysis', 'sec: one standard error rule']} with $\tt OCV_\textsc{DR}$. This shows that both improvements individually help reduce the variance of estimation errors. However, when combined, the theory split ratio makes the one standard error rule insignificant.
Figure 5: Ablation of the number of repeated training/validation splits with $\tt OCV_\textsc{DR}$ on the vehicle dataset averaged over 500 runs. This shows us diminishing improvements as we increase the number of splits.
...and 2 more figures

Theorems & Definitions (4)

Theorem 1
proof
Theorem 2
proof

Cross-Validated Off-Policy Evaluation

TL;DR

Abstract

Cross-Validated Off-Policy Evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (4)