Table of Contents
Fetching ...

Sources of Gain: Decomposing Performance in Conditional Average Dose Response Estimation

Christopher Bockel-Rickermann, Toon Vanderschueren, Tim Verdonck, Wouter Verbeke

TL;DR

We address CADR estimation from observational data by introducing a unifying problem formulation and a five-component decomposition that disentangles non-linearity, intervention and dose confounding, and their distributional imbalances. The approach is applied to eight estimators across TCGA-2 and IHDP-derived datasets, revealing that non-uniform dose distributions and non-linear CADR surfaces largely drive performance, while confounding often plays a minor role on standard benchmarks. A new IHDP-3 dataset demonstrates that confounding becomes problematic only under high heterogeneity in CADR responses, highlighting the need for benchmarks that stress both heterogeneity and confounding. The work advocates a standardized, data-centric evaluation framework to guide method development and benchmarking in CADR estimation.

Abstract

Estimating conditional average dose responses (CADR) is an important but challenging problem. Estimators must correctly model the potentially complex relationships between covariates, interventions, doses, and outcomes. In recent years, the machine learning community has shown great interest in developing tailored CADR estimators that target specific challenges. Their performance is typically evaluated against other methods on (semi-) synthetic benchmark datasets. Our paper analyses this practice and shows that using popular benchmark datasets without further analysis is insufficient to judge model performance. Established benchmarks entail multiple challenges, whose impacts must be disentangled. Therefore, we propose a novel decomposition scheme that allows the evaluation of the impact of five distinct components contributing to CADR estimator performance. We apply this scheme to eight popular CADR estimators on four widely-used benchmark datasets, running nearly 1,500 individual experiments. Our results reveal that most established benchmarks are challenging for reasons different from their creators' claims. Notably, confounding, the key challenge tackled by most estimators, is not an issue in any of the considered datasets. We discuss the major implications of our findings and present directions for future research.

Sources of Gain: Decomposing Performance in Conditional Average Dose Response Estimation

TL;DR

We address CADR estimation from observational data by introducing a unifying problem formulation and a five-component decomposition that disentangles non-linearity, intervention and dose confounding, and their distributional imbalances. The approach is applied to eight estimators across TCGA-2 and IHDP-derived datasets, revealing that non-uniform dose distributions and non-linear CADR surfaces largely drive performance, while confounding often plays a minor role on standard benchmarks. A new IHDP-3 dataset demonstrates that confounding becomes problematic only under high heterogeneity in CADR responses, highlighting the need for benchmarks that stress both heterogeneity and confounding. The work advocates a standardized, data-centric evaluation framework to guide method development and benchmarking in CADR estimation.

Abstract

Estimating conditional average dose responses (CADR) is an important but challenging problem. Estimators must correctly model the potentially complex relationships between covariates, interventions, doses, and outcomes. In recent years, the machine learning community has shown great interest in developing tailored CADR estimators that target specific challenges. Their performance is typically evaluated against other methods on (semi-) synthetic benchmark datasets. Our paper analyses this practice and shows that using popular benchmark datasets without further analysis is insufficient to judge model performance. Established benchmarks entail multiple challenges, whose impacts must be disentangled. Therefore, we propose a novel decomposition scheme that allows the evaluation of the impact of five distinct components contributing to CADR estimator performance. We apply this scheme to eight popular CADR estimators on four widely-used benchmark datasets, running nearly 1,500 individual experiments. Our results reveal that most established benchmarks are challenging for reasons different from their creators' claims. Notably, confounding, the key challenge tackled by most estimators, is not an issue in any of the considered datasets. We discuss the major implications of our findings and present directions for future research.
Paper Structure (19 sections, 2 equations, 9 figures, 16 tables, 1 algorithm)

This paper contains 19 sections, 2 equations, 9 figures, 16 tables, 1 algorithm.

Figures (9)

  • Figure 1: Selected components of our decomposition scheme. To disentangle the effects of confounding from the effects of non-uniform distributions of doses, we evaluate estimators in three scenarios: 1) When doses are randomly sampled from a uniform distribution, 2) when those distributions are not uniform, but also not specific to a certain unit, and 3) when the data is confounded, so when dose assignment is specific to a certain unit. The distribution of doses across the total population is the same in steps 2) and 3). Our complete scheme includes two additional steps related to the distribution of interventions when there are multiple intervention options (cf. Section \ref{['sec:decompose-dgp']}).
  • Figure 2: SWIG illustrating causal dependencies in observational data
  • Figure 3: Dose distribution for different levels of confounding in data by Bica2020
  • Figure 4: Distribution of errors per intervention and dose interval the test set of the TCGA-2 dataset estimated by an MLP. A histogram of doses in the training set is added per plot in blue. Errors are correlated with dose non-uniformity, supporting that non-uniformity affects model performance.
  • Figure 5: MISE per method and dataset. Across datasets, confounding has little adverse effects on model performance. Full results including std. errors can be found in Appendix \ref{['sec:A_results-per-ds']}.
  • ...and 4 more figures