Table of Contents
Fetching ...

Combining Incomplete Observational and Randomized Data for Heterogeneous Treatment Effects

Dong Yao, Caizhi Tang, Qing Cui, Longfei Li

TL;DR

A resilient approach to Combine Incomplete Observational data and randomized data for HTE estimation, which is abbreviate as CIO is proposed, capable of estimating HTEs efficiently regardless of the completeness of the observational data, be it full or partial.

Abstract

Data from observational studies (OSs) is widely available and readily obtainable yet frequently contains confounding biases. On the other hand, data derived from randomized controlled trials (RCTs) helps to reduce these biases; however, it is expensive to gather, resulting in a tiny size of randomized data. For this reason, effectively fusing observational data and randomized data to better estimate heterogeneous treatment effects (HTEs) has gained increasing attention. However, existing methods for integrating observational data with randomized data must require \textit{complete} observational data, meaning that both treated subjects and untreated subjects must be included in OSs. This prerequisite confines the applicability of such methods to very specific situations, given that including all subjects, whether treated or untreated, in observational studies is not consistently achievable. In our paper, we propose a resilient approach to \textbf{C}ombine \textbf{I}ncomplete \textbf{O}bservational data and randomized data for HTE estimation, which we abbreviate as \textbf{CIO}. The CIO is capable of estimating HTEs efficiently regardless of the completeness of the observational data, be it full or partial. Concretely, a confounding bias function is first derived using the pseudo-experimental group from OSs, in conjunction with the pseudo-control group from RCTs, via an effect estimation procedure. This function is subsequently utilized as a corrective residual to rectify the observed outcomes of observational data during the HTE estimation by combining the available observational data and the all randomized data. To validate our approach, we have conducted experiments on a synthetic dataset and two semi-synthetic datasets.

Combining Incomplete Observational and Randomized Data for Heterogeneous Treatment Effects

TL;DR

A resilient approach to Combine Incomplete Observational data and randomized data for HTE estimation, which is abbreviate as CIO is proposed, capable of estimating HTEs efficiently regardless of the completeness of the observational data, be it full or partial.

Abstract

Data from observational studies (OSs) is widely available and readily obtainable yet frequently contains confounding biases. On the other hand, data derived from randomized controlled trials (RCTs) helps to reduce these biases; however, it is expensive to gather, resulting in a tiny size of randomized data. For this reason, effectively fusing observational data and randomized data to better estimate heterogeneous treatment effects (HTEs) has gained increasing attention. However, existing methods for integrating observational data with randomized data must require \textit{complete} observational data, meaning that both treated subjects and untreated subjects must be included in OSs. This prerequisite confines the applicability of such methods to very specific situations, given that including all subjects, whether treated or untreated, in observational studies is not consistently achievable. In our paper, we propose a resilient approach to \textbf{C}ombine \textbf{I}ncomplete \textbf{O}bservational data and randomized data for HTE estimation, which we abbreviate as \textbf{CIO}. The CIO is capable of estimating HTEs efficiently regardless of the completeness of the observational data, be it full or partial. Concretely, a confounding bias function is first derived using the pseudo-experimental group from OSs, in conjunction with the pseudo-control group from RCTs, via an effect estimation procedure. This function is subsequently utilized as a corrective residual to rectify the observed outcomes of observational data during the HTE estimation by combining the available observational data and the all randomized data. To validate our approach, we have conducted experiments on a synthetic dataset and two semi-synthetic datasets.

Paper Structure

This paper contains 18 sections, 9 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The data composition under the two situation: complete and incomplete OS data. For illustration, the right subfigure demonstrates a case where the control group is missing. It should be noted that in practice, the treatment group could also be absent.
  • Figure 2: Comparison among data-fusion baselines under Ridge and RF with an increasing ratio of RCT data for training. We plot the results upon Simulation dataset, STAR dataset and NSW dataset on Figure 1(a), 1(b) and 1(c) respectively.
  • Figure 3: For all data-fusion techniques using Ridge and RF, we observe $\sqrt{\epsilon_{PEHE}}$ across a range of $\beta$ values that modulate the intensity of the confounding bias in the training OS data.
  • Figure 4: We change the quantity of control data from the OS used in the training stage, under which we evaluate the efficacy of various data-fusion techniques implemented with Ridge regression. The OS controls' number varies from a range of {1, 4, 16, 64, 256, 512}. Results pertaining to the Simulation dataset are illustrated in Figure 3(a) and for the STAR dataset in Figure 3(b).