Table of Contents
Fetching ...

Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation

Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito

TL;DR

A new metric is developed, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator under varying online evaluation budgets (k), and is integrated into an open-source software, SCOPE-RL, to facilitate a quick, accurate, and consistent evaluation of OPE.

Abstract

Off-Policy Evaluation (OPE) aims to assess the effectiveness of counterfactual policies using only offline logged data and is often used to identify the top-k promising policies for deployment in online A/B tests. Existing evaluation metrics for OPE estimators primarily focus on the "accuracy" of OPE or that of downstream policy selection, neglecting risk-return tradeoff in the subsequent online policy deployment. To address this issue, we draw inspiration from portfolio evaluation in finance and develop a new metric, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator under varying online evaluation budgets (k). We validate our metric in two example scenarios, demonstrating its ability to effectively distinguish between low-risk and high-risk estimators and to accurately identify the most efficient one. Efficiency of an estimator is characterized by its capability to form the most advantageous policy portfolios, maximizing returns while minimizing risks during online deployment, a nuance that existing metrics typically overlook. To facilitate a quick, accurate, and consistent evaluation of OPE via SharpeRatio@k, we have also integrated this metric into an open-source software, SCOPE-RL (https://github.com/hakuhodo-technologies/scope-rl). Employing SharpeRatio@k and SCOPE-RL, we conduct comprehensive benchmarking experiments on various estimators and RL tasks, focusing on their risk-return tradeoff. These experiments offer several interesting directions and suggestions for future OPE research.

Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation

TL;DR

A new metric is developed, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator under varying online evaluation budgets (k), and is integrated into an open-source software, SCOPE-RL, to facilitate a quick, accurate, and consistent evaluation of OPE.

Abstract

Off-Policy Evaluation (OPE) aims to assess the effectiveness of counterfactual policies using only offline logged data and is often used to identify the top-k promising policies for deployment in online A/B tests. Existing evaluation metrics for OPE estimators primarily focus on the "accuracy" of OPE or that of downstream policy selection, neglecting risk-return tradeoff in the subsequent online policy deployment. To address this issue, we draw inspiration from portfolio evaluation in finance and develop a new metric, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator under varying online evaluation budgets (k). We validate our metric in two example scenarios, demonstrating its ability to effectively distinguish between low-risk and high-risk estimators and to accurately identify the most efficient one. Efficiency of an estimator is characterized by its capability to form the most advantageous policy portfolios, maximizing returns while minimizing risks during online deployment, a nuance that existing metrics typically overlook. To facilitate a quick, accurate, and consistent evaluation of OPE via SharpeRatio@k, we have also integrated this metric into an open-source software, SCOPE-RL (https://github.com/hakuhodo-technologies/scope-rl). Employing SharpeRatio@k and SCOPE-RL, we conduct comprehensive benchmarking experiments on various estimators and RL tasks, focusing on their risk-return tradeoff. These experiments offer several interesting directions and suggestions for future OPE research.
Paper Structure (40 sections, 13 equations, 12 figures, 12 tables)

This paper contains 40 sections, 13 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: (Left) Conventional (off-)policy selection directly chooses the production policy via OPE. (Right) Practical workflow of policy evaluation and selection involves OPE as a screening process where an OPE estimator ($\hat{J}$) chooses top-$k$ candidate policies that are to be tested in online A/B tests, where $k$ is a pre-defined online evaluation budget. A policy that is identified as the best policy based on the evaluation process will be chosen as the production policy ($\hat{\pi}^*$).
  • Figure 2: A toy example illustrating the situation where existing metrics (MSE, Rankcorr, and Regret) fail to evaluate the risk in the off-policy selection (OPS) task. Both estimators X and Y conduct OPS on the same set of candidate policies. X underestimates the values of the policies indicated by the black dots while Y overestimates them. The shaded regions show the top-3 policies (policy portfolio) selected by each estimator, indicating that Y is riskier than X since Y includes worse policies in its policy portfolio. Nonetheless, existing metrics give completely identical evaluations for X and Y.
  • Figure 3: Evaluating estimators X and Y in the toy example of Figure \ref{['fig:toy_example_1']} with SharpeRatio@k.
  • Figure 4: A toy example illustrating the case of evaluating a conservative OPE (estimator W) and uniform random selection (estimator Z) with conventional evaluation-of-OPE metrics (the right top table) and SharpeRatio@k (the bottom figures).
  • Figure 5:
  • ...and 7 more figures