Detecting critical treatment effect bias in small subgroups

Piersilvio De Bartolomeis; Javier Abad; Konstantin Donhauser; Fanny Yang

Detecting critical treatment effect bias in small subgroups

Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, Fanny Yang

TL;DR

This work proposes a novel strategy to benchmark observational studies beyond the average treatment effect and estimates an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study.

Abstract

Randomized trials are considered the gold standard for making informed decisions in medicine, yet they often lack generalizability to the patient populations in clinical practice. Observational studies, on the other hand, cover a broader patient population but are prone to various biases. Thus, before using an observational study for decision-making, it is crucial to benchmark its treatment effect estimates against those derived from a randomized trial. We propose a novel strategy to benchmark observational studies beyond the average treatment effect. First, we design a statistical test for the null hypothesis that the treatment effects estimated from the two studies, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study. Finally, we validate our benchmarking strategy in a real-world setting and show that it leads to conclusions that align with established medical knowledge.

Detecting critical treatment effect bias in small subgroups

TL;DR

Abstract

Paper Structure (58 sections, 2 theorems, 63 equations, 5 figures, 1 table)

This paper contains 58 sections, 2 theorems, 63 equations, 5 figures, 1 table.

Introduction
Problem setting
Treatment effect estimation
Null hypothesis
Discussion of our null hypothesis
Example 1: User-specified tolerance
Example 2: Sensitivity analysis bounds
Methodology
Null hypothesis using signal function
Oracle test statistic
A valid test statistic
Why not a classic U-statistic?
Theoretical guarantees
Discussion of assumptions
Power of the test
...and 43 more sections

Key Result

Theorem 3.1

We make the following assumptions: Then, we have that Hence, $\hat{\phi}( \alpha)$ is a valid asymptotic test at level $\alpha$ for the null hypothesis $H_0 ^\mathcal{G}$ from Equation eq:nullg.

Figures (5)

Figure 1: High-level illustration of our approach. We want to test if the bias in the observational study, i.e. $\mu^{\mathrm{os}} -\tau^{\mathrm{os}}$, is contained within a tolerance range. However, the true treatment effect $\mu^{\mathrm{os}}$ is not identifiable, and instead, we test the bias between the treatment effects estimated from the two studies, i.e. $\tau^{\mathrm{os}} -\tau^{\mathrm{rct}}$.
Figure 2: For all the plots: the significance level is set at $\alpha=0.05$, $\phi^\star$ denotes the oracle test, which rejects for $\delta<\delta^{\star}$. (a-b) Scenario 1, comprising a single subgroup with a constant bias $\delta^{\star}=60$: we plot the bias lower bound $\hat{\delta}_{\texttt{LB}}$ as a function of (a) the biased subgroup percentage w.r.t. total sample size and (b) the randomized trial sample size. (c-d) Probability of rejection for different function classes $\mathcal{G}$ as a function of the user-specified tolerance $\delta$ for (c) Scenario 2 (\ref{['fig:heatmap_scenario2']}) based on 12 subgroups with different biases and (d) Scenario 3 (\ref{['fig:heatmap_scenario3']}) based on a quadratic polynomial bias. We report mean and standard error over 5 runs. The coefficients for the polynomial bias are fixed across runs.
Figure 3: For all the plots: the significance level is set at $\alpha=0.05$, and the bias model is from Scenario 2. (a) Effect of varying the feature set $X^{\mathcal{J}}$ on the average lower bound $\hat{\delta}_{\texttt{LB}}$, illustrating the trade-off between feature set size and the power of the test. $\phi^\star$ represents the oracle test, which rejects for $\delta<\delta^{\star}$. The highest power is achieved when the feature set size $|X^{\mathcal{J}}|=3$, including only the relevant features to model the bias. We average runs over 5 seeds and report the standard error. (b) Evolution of the test statistic with respect to the training epochs using the small neural network. We set the user tolerance to $\delta=58$, close to the maximum true bias $\delta^{\star} =60$. The dashed red line represents the $\alpha$-quantile of the absolute normal distribution. The rest of the hyperparameters are the same as in the experimental setting from \ref{['sec:exp']}.
Figure 4: Comparison between the estimated and true bias models for Scenario 2. Our estimates of the bias from \ref{['eq:groupbias']} closely align with the true bias. We run the test with a random seed, using the same hyperparameters as in our experimental evaluation, and set the user tolerance to $\delta=57$.
Figure 5: Heatmap visualizations of the bias for (a) Scenario 2 based on 12 subgroups with different biases (the numbers in the cells represent the percentage w.r.t. the full observational dataset), and (b) Scenario 3 based on a quadratic polynomial bias.

Theorems & Definitions (2)

Theorem 3.1: Validity of the test
Theorem A.1

Detecting critical treatment effect bias in small subgroups

TL;DR

Abstract

Detecting critical treatment effect bias in small subgroups

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)