Table of Contents
Fetching ...

SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

Shahriar Noroozizadeh, Xiaobin Shen, Jeremy C. Weiss, George H. Chen

TL;DR

SurvHTE-Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods under diverse conditions and realistic assumption violations, and provides the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations.

Abstract

Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE-Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi-synthetic, and real-world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTE-Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods. The data and code of our benchmark are available at: https://github.com/Shahriarnz14/SurvHTE-Bench .

SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

TL;DR

SurvHTE-Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods under diverse conditions and realistic assumption violations, and provides the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations.

Abstract

Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE-Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi-synthetic, and real-world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTE-Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods. The data and code of our benchmark are available at: https://github.com/Shahriarnz14/SurvHTE-Bench .
Paper Structure (75 sections, 35 equations, 24 figures, 35 tables)

This paper contains 75 sections, 35 equations, 24 figures, 35 tables.

Figures (24)

  • Figure 1: (top) Borda count rankings of the top 10 estimator variants (out of 53 total), based on CATE RMSE across 40 datasets and averaged over 10 repeats (lower is better). (bottom) Family-level rankings, where for each dataset the best method variant within each method family is chosen using validation performance and then ranked on the held-out test set. Black bands connect methods without statistically significant differences (Wilcoxon signed-rank test, FDR-corrected at $\alpha=0.05$). Shaded regions indicate the standard error of the rank across datasets.
  • Figure 2: CATE RMSE in Scenario C across 10 experimental repeats.
  • Figure 3: CATE RMSE for twin birth data with $h=30$ days across 10 experimental runs.
  • Figure 4: CATE estimation comparison between baseline and high-censoring conditions under ZDV vs. ZDV+ddI treatments. Each point represents an individual patient, with the dashed diagonal line indicating perfect consistency between baseline CATE estimation and that with the additional censoring injected.
  • Figure 5: (Synthetic datasets) Kaplan-Meier curves across causal configurations (rows) and survival scenarios (columns). Solid lines show event-time survival under control (blue) and treatment (orange); dotted lines show censoring-time survival for each arm. Each panel reports the empirical censoring rate $c$ and treatment probability $p$.
  • ...and 19 more figures