Evaluation of machine-learning models to measure individualized treatment effects from randomized clinical trial data with time-to-event outcomes
Elvire Roblin, Paul-Henry Cournède, Stefan Michiels
TL;DR
The paper addresses how to estimate individualized treatment effects in randomized trials with time-to-event outcomes when high-dimensional data and nonlinear biomarker interactions are present. It compares neural-network–based survival models (CoxCC, CoxTime) and Interaction Forests against a CoxPH baseline with adaptive LASSO, using adapted CATE metrics for censoring. Across two simulation designs and three cancer datasets, the methods show complementary strengths: Interaction Forests excel at discriminating treatment heterogeneity (C_benefit), while neural-network–based models provide well-calibrated benefit estimates, and adaptive LASSO remains competitive on RMSE. This work demonstrates the potential of causal machine learning for precision medicine in survival contexts and offers a practical evaluation framework for time-to-event individualized treatment effects.
Abstract
Objective: In randomized clinical trials, prediction models can be used to explore the relationships between patients' variables (e.g., clinical, pathological, or lifestyle variables, and also biomarker or genomic data) and treatment effect magnitude. Our aim was to evaluate flexible machine learning models capable of incorporating interactions and nonlinear effects from high-dimensional data to estimate individualized treatment recommendations in trials with time-to-event outcomes. Methods: We compared survival models based on neural networks (CoxCC and CoxTime) and random survival forests (Interaction Forests) against a Cox proportional hazards model with an adaptive LASSO (ALASSO) penalty as a benchmark. For individualized treatment recommendations in the survival setting, we adapted metrics originally designed for binary outcomes to accommodate time-to-event data with censoring. These adapted metrics included the C-for-Benefit, the E50-for-Benefit, and the root mean squared error for treatment benefit. An extensive simulation study was conducted using two different data generation processes incorporating nonlinearity and interactions. The models were applied to gene expression and clinical data from three cancer clinical trial data sets. Results: In the first data generation process, neural networks outperformed ALASSO in terms of calibration while the Interaction Forests showed superior C-for-benefit performance. In the second data generation process, both machine learning methods outperformed the benchmark linear ALASSO method across discrimination, calibration, and RMSE metrics. In the cancer trial data sets, the machine learning methods often performed better than ALASSO, particularly IF in terms of C-for-benefit, and either a neural network or IF for calibration measures addressing treatment benefit.
