When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective

Hao Sun; Alex J. Chan; Nabeel Seedat; Alihan Hüyük; Mihaela van der Schaar

When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective

Hao Sun, Alex J. Chan, Nabeel Seedat, Alihan Hüyük, Mihaela van der Schaar

TL;DR

This work proposes DataCOPE, a data-centric framework for evaluating OPE that forecasts the overall performance of OPE algorithms without access to the environment, which is especially useful before real-world deployment where evaluating OPE is impossible.

Abstract

Evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging. On the one hand, it brings opportunities for safe policy improvement under high-stakes scenarios like clinical guidelines. On the other hand, such opportunities raise a need for precise off-policy evaluation (OPE). While previous work on OPE focused on improving the algorithm in value estimation, in this work, we emphasize the importance of the offline dataset, hence putting forward a data-centric framework for evaluating OPE problems. We propose DataCOPE, a data-centric framework for evaluating OPE, that answers the questions of whether and to what extent we can evaluate a target policy given a dataset. DataCOPE (1) forecasts the overall performance of OPE algorithms without access to the environment, which is especially useful before real-world deployment where evaluating OPE is impossible; (2) identifies the sub-group in the dataset where OPE can be inaccurate; (3) permits evaluations of datasets or data-collection strategies for OPE problems. Our empirical analysis of DataCOPE in the logged contextual bandit settings using healthcare datasets confirms its ability to evaluate both machine-learning and human expert policies like clinical guidelines. Finally, we apply DataCOPE to the task of reward modeling in Large Language Model alignment to demonstrate its scalability in real-world applications.

When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective

TL;DR

Abstract

Paper Structure (60 sections, 10 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 60 sections, 10 equations, 8 figures, 6 tables, 1 algorithm.

Introduction
Preliminaries
Plug-in Estimation of the Value Function.
Data Characterization.
When is OPE Useful?
Inherent Difficulty of OPE: A Data-Centric Perspective
Distributional Direct Method for Uncertainty Decomposition
DM with Distributional Reward Estimators
Uncertainty Decomposition
Residual Prediction through Uncertainty
Experiments
Synthetic Dataset Generation
Target Policy
General Experiment Settings
Instance-Wise Difficulty Indication
...and 45 more sections

Figures (8)

Figure 1: Road map of DataCOPE. To highlight the difference between DataCOPE and classical OPE literature: the objective of DataCOPE is to evaluate whether OPE problems are well-defined, while OPE focus on improving estimators. 1-st row: illustrates the offline dataset collection process. In the context of healthcare, it corresponds to treatment records abiding by an existing guideline. 2-nd row: (Sec.2) The collected dataset $\mathcal{D}$ is then used for OPE. For an OPE algorithm, Equation (\ref{['eqn:2']}) calculates the test-time residual (error) between the value estimation result from an algorithm and the true value. 3-rd row: (Sec.3) DataCOPE can serve as a proxy for the evaluation residual. It can work in test-time as an OPE performance indicator without access to the true value $V(\pi_e)$ and the environment. 4-th row: (Sec.4) DataCOPE can be applied to various use cases, which is demonstrated with extensive empirical studies. Notions are explained in Sec.\ref{['sec:pre']}.
Figure 2: DataCOPE decomposes the uncertainty and provides an instance-wise prediction of the estimation error. In the 3-D plots, we use colors to highlight the averaged OPE residual values (also z-axis) and visualize their strong correlation with the two uncertainty components. In the 2-D plots, we visualize the correlation between each uncertainty component and the averaged OPE residual values. The results highlight the necessity of uncertainty decomposition as different components may dominate the prediction of OPE residuals. This high correlation holds for different algorithms. For detailed non-averaged results, please refer to Appendix \ref{['appdx:non-avg_results']}.
Figure 3: Evaluating MELD over time. DataCOPE can be applied to monitor the OPE performance without the real policy value.
Figure 4: DataCOPE works as an accurate proxy of the OPE residual. DataCOPE is able to predict OPE residuals of different algorithms with calibration. Dataset: Breast Cancer.
Figure 5: DataCOPE works as an accurate proxy of the OPE residual. DataCOPE is able to predict OPE residuals of different algorithms with calibration. Dataset: Diabetes.
...and 3 more figures

When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective

TL;DR

Abstract

When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (8)