Cross-Validated Off-Policy Evaluation
Matej Cief, Branislav Kveton, Michal Kompan
TL;DR
Problem: estimating $V(\pi)$ in off-policy settings where ground-truth policy value is unknown, and selecting estimators or tuning hyper-parameters is challenging. Approach: adapt cross-validation to OPE by using an unbiased validator on held-out data and a training/validation split strategy, culminating in the Off-Policy Cross-Validation (OCV) algorithm with a one-standard-error rule. Contributions: a general, simple, and efficient framework for estimator selection and hyper-parameter tuning in OPE, validated on nine real datasets and outperforming state-of-the-art baselines. Significance: enables data-driven, robust OPE in contexts where online experimentation is costly or risky, with potential extensions to offline policy learning and reinforcement learning.
Abstract
We study estimator selection and hyper-parameter tuning in off-policy evaluation. Although cross-validation is the most popular method for model selection in supervised learning, off-policy evaluation relies mostly on theory, which provides only limited guidance to practitioners. We show how to use cross-validation for off-policy evaluation. This challenges a popular belief that cross-validation in off-policy evaluation is not feasible. We evaluate our method empirically and show that it addresses a variety of use cases.
