Combining Experimental and Historical Data for Policy Evaluation

Ting Li; Chengchun Shi; Qianglin Wen; Yang Sui; Yongli Qin; Chunbo Lai; Hongtu Zhu

Combining Experimental and Historical Data for Policy Evaluation

Ting Li, Chengchun Shi, Qianglin Wen, Yang Sui, Yongli Qin, Chunbo Lai, Hongtu Zhu

TL;DR

The paper tackles policy evaluation when multiple data sources are available, notably an experimental dataset with two arms and a historical control dataset. It introduces two linear-weighted estimators that combine base estimators from experimental and historical data, with weights chosen to minimize the mean squared error and a pessimistic variant to gain robustness under reward shifts between datasets. The authors establish non-asymptotic MSE bounds, oracle and robustness properties across a spectrum of reward shift regimes, and demonstrate superior empirical performance on simulated and real ridesharing data, as well as sequential decision-making settings. The contribution advances data integration for causal learning by accommodating distributional shifts and providing practical guidance on estimator choice in different regimes, with implications for offline policy evaluation and sequential RL contexts.

Abstract

This paper studies policy evaluation with multiple data sources, especially in scenarios that involve one experimental dataset with two arms, complemented by a historical dataset generated under a single control arm. We propose novel data integration methods that linearly integrate base policy value estimators constructed based on the experimental and historical data, with weights optimized to minimize the mean square error (MSE) of the resulting combined estimator. We further apply the pessimistic principle to obtain more robust estimators, and extend these developments to sequential decision making. Theoretically, we establish non-asymptotic error bounds for the MSEs of our proposed estimators, and derive their oracle, efficiency and robustness properties across a broad spectrum of reward shift scenarios. Numerical experiments and real-data-based analyses from a ridesharing company demonstrate the superior performance of the proposed estimators.

Combining Experimental and Historical Data for Policy Evaluation

TL;DR

Abstract

Paper Structure (15 sections, 11 theorems, 57 equations, 11 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 11 theorems, 57 equations, 11 figures, 1 table, 1 algorithm.

Introduction
Related Work
Estimators in Non-dynamic Setting
Extension to Sequential Decision Making
Theoretical Properties
Properties of the Non-pessimistic Estimator
Robustness of the Pessimistic Estimator
Experiments
Additional Experiment Results
A hybrid procedure
Extension to Sequential Decision Making
Implementation Details
Proofs of the Theorems in Section \ref{['sec:theorysampleestimator']}
Notations and Auxiliary Lemmas
Auxiliary Lemmas

Key Result

Theorem 1

Under Assumptions con:coverage -- con:double, the excess MSE of the non-pessimistic estimator compared to $\widehat{\tau}_{w^*}$, i.e., $\textrm{MSE}(\widehat{\tau}_{\widehat{w}})-\textrm{MSE}(\widehat{\tau}_{w^*})$ can be upper bounded by

Figures (11)

Figure 1: Distributions of estimated costs for optimal and sub-optimal arms. A key challenge arises when the estimated cost of a sub-optimal arm is inaccurately high, leading to failure of the greedy action selection method. To address this issue, we apply the pessimistic principle which takes into account the uncertainties inherent in these estimations. The estimates of the cost under the two arms are given by $\widehat{Cost}_k~(k=1,2)$ with their pessimistic versions $\widehat{Cost}_{k,U} ~(k=1,2)$. By comparing the upper bounds of the estimated costs, we effectively identify the optimal arm.
Figure 2: Boxplots of the SEE and the oracle MSE under the setting of Example \ref{['ex:single_stage']} when the bias $b_h=0$, and $d$ indicates the difference of the conditional variance of the reward between the experimental data and historical data.
Figure 3: Empirical means of MSEs for different methods under the switchback design in Example \ref{['ex:single_stage']}. The top panel displays all the methods, whereas the bottom panel focuses on the area excluding the SPE method.
Figure 4: Empirical Means of MSEs for different methods in Example \ref{['ex:real_data_based_agnostic']}. The treatment effect ratios are equal to $5\%$ (Top), $10\%$ (Bottom), respectively.
Figure A1: Visual representations of scaled states and rewards in one city across 40 days, comprising drivers' total income, the number of requests, and drivers' total online time. Each line represents data from a specific day.
...and 6 more figures

Theorems & Definitions (33)

Remark 1
Definition 1: Experimental-data-only Estimator
Remark 2
Definition 2: Historical-data-based Estimator
Remark 3
Definition 3: Weighted Estimator
Remark 4
Remark 5
Remark 6
Theorem 1: MSE of the non-pessimistic estimator
...and 23 more

Combining Experimental and Historical Data for Policy Evaluation

TL;DR

Abstract

Combining Experimental and Historical Data for Policy Evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (33)