Table of Contents
Fetching ...

Bootstrapping-based Regularisation for Reducing Individual Prediction Instability in Clinical Risk Prediction Models

Sara Matijevic, Christopher Yau

TL;DR

This work tackles the problem of unstable individual predictions in clinical risk models by embedding a bootstrap-based regularisation directly into the training objective of a deep neural network. The approach penalises deviations between the model's predictions and those from bootstrapped resamples, yielding a single model with ensemble-like robustness. Across simulations and real clinical datasets (GUSTO-I, Framingham, SUPPORT), the stable model reduces the mean absolute difference (MAD) and significantly fewer predictions deviate from bootstrap medians, while maintaining AUC and SHAP-based interpretability. Ensemble methods offer greater stability but at the cost of interpretability, making the proposed method a practical and clinically trustworthy route to robust, transparent deep learning models in data-limited healthcare settings.

Abstract

Clinical prediction models are increasingly used to support patient care, yet many deep learning-based approaches remain unstable, as their predictions can vary substantially when trained on different samples from the same population. Such instability undermines reliability and limits clinical adoption. In this study, we propose a novel bootstrapping-based regularisation framework that embeds the bootstrapping process directly into the training of deep neural networks. This approach constrains prediction variability across resampled datasets, producing a single model with inherent stability properties. We evaluated models constructed using the proposed regularisation approach against conventional and ensemble models using simulated data and three clinical datasets: GUSTO-I, Framingham, and SUPPORT. Across all datasets, our model exhibited improved prediction stability, with lower mean absolute differences (e.g., 0.019 vs. 0.059 in GUSTO-I; 0.057 vs. 0.088 in Framingham) and markedly fewer significantly deviating predictions. Importantly, discriminative performance and feature importance consistency were maintained, with high SHAP correlations between models (e.g., 0.894 for GUSTO-I; 0.965 for Framingham). While ensemble models achieved greater stability, we show that this came at the expense of interpretability, as each constituent model used predictors in different ways. By regularising predictions to align with bootstrapped distributions, our approach allows prediction models to be developed that achieve greater robustness and reproducibility without sacrificing interpretability. This method provides a practical route toward more reliable and clinically trustworthy deep learning models, particularly valuable in data-limited healthcare settings.

Bootstrapping-based Regularisation for Reducing Individual Prediction Instability in Clinical Risk Prediction Models

TL;DR

This work tackles the problem of unstable individual predictions in clinical risk models by embedding a bootstrap-based regularisation directly into the training objective of a deep neural network. The approach penalises deviations between the model's predictions and those from bootstrapped resamples, yielding a single model with ensemble-like robustness. Across simulations and real clinical datasets (GUSTO-I, Framingham, SUPPORT), the stable model reduces the mean absolute difference (MAD) and significantly fewer predictions deviate from bootstrap medians, while maintaining AUC and SHAP-based interpretability. Ensemble methods offer greater stability but at the cost of interpretability, making the proposed method a practical and clinically trustworthy route to robust, transparent deep learning models in data-limited healthcare settings.

Abstract

Clinical prediction models are increasingly used to support patient care, yet many deep learning-based approaches remain unstable, as their predictions can vary substantially when trained on different samples from the same population. Such instability undermines reliability and limits clinical adoption. In this study, we propose a novel bootstrapping-based regularisation framework that embeds the bootstrapping process directly into the training of deep neural networks. This approach constrains prediction variability across resampled datasets, producing a single model with inherent stability properties. We evaluated models constructed using the proposed regularisation approach against conventional and ensemble models using simulated data and three clinical datasets: GUSTO-I, Framingham, and SUPPORT. Across all datasets, our model exhibited improved prediction stability, with lower mean absolute differences (e.g., 0.019 vs. 0.059 in GUSTO-I; 0.057 vs. 0.088 in Framingham) and markedly fewer significantly deviating predictions. Importantly, discriminative performance and feature importance consistency were maintained, with high SHAP correlations between models (e.g., 0.894 for GUSTO-I; 0.965 for Framingham). While ensemble models achieved greater stability, we show that this came at the expense of interpretability, as each constituent model used predictors in different ways. By regularising predictions to align with bootstrapped distributions, our approach allows prediction models to be developed that achieve greater robustness and reproducibility without sacrificing interpretability. This method provides a practical route toward more reliable and clinically trustworthy deep learning models, particularly valuable in data-limited healthcare settings.
Paper Structure (22 sections, 5 equations, 7 figures, 1 table)

This paper contains 22 sections, 5 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Approach Overview. Prediction models trained on different development data sets $\mathcal{D}/\mathcal{D}'$, even drawn from the same population $\mathcal{P}$, can lead to models ($f_\theta/f_{\theta'}$) which produce different risk probabilities at the individual level.
  • Figure 2: Prediction model stability. Comparison of (A) prediction stability (MAD), (B) predictive performance (AUC), and (C) the proportion of significantly deviating predictions for the standard (violet), stable (orange), and ensemble (gray) models across the simulated, GUSTO‐I, Framingham, and SUPPORT datasets.
  • Figure 3: Individual prediction stability. Individual-level predictions for selected participants from (A) the simulated, (B) GUSTO-I, (C) Framingham and (D) SUPPORT datasets. The violin plots display the distribution of predictions from 200 bootstrapped models while the dots indicate the standard (violet) and stable (orange)predictions.
  • Figure 4: Prediction deviation. Histograms of p-values for the significance of individual-level prediction deviations from the bootstrapped median for stable (orange) and standard (violet) models for the (A) simulated, (B) GUSTO-I, (C) Framingham and (D) SUPPORT datasets. The dashed line shows a 0.05 significance threshold.
  • Figure 5: Hyperparameter Sensitivity. Impact on AUC and MAD from changes in (A)$\lambda$ and (B) number of bootstrapped models in the stable model for GUSTO, Framingham and SUPPORT. The diamond and the dot represent the ensemble and standard model respectively.
  • ...and 2 more figures