Table of Contents
Fetching ...

Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare

Elizabeth W. Miller, Jeffrey D. Blume

TL;DR

An evaluation framework is proposed that quantifies individual-level prediction instability by using two complementary diagnostics: empirical prediction interval width (ePIW), which captures variability in continuous risk estimates, and empirical decision flip rate (eDFR), which measures instability in threshold-based clinical decisions.

Abstract

In healthcare, predictive models increasingly inform patient-level decisions, yet little attention is paid to the variability in individual risk estimates and its impact on treatment decisions. For overparameterized models, now standard in machine learning, a substantial source of variability often goes undetected. Even when the data and model architecture are held fixed, randomness introduced by optimization and initialization can lead to materially different risk estimates for the same patient. This problem is largely obscured by standard evaluation practices, which rely on aggregate performance metrics (e.g., log-loss, accuracy) that are agnostic to individual-level stability. As a result, models with indistinguishable aggregate performance can nonetheless exhibit substantial procedural arbitrariness, which can undermine clinical trust. We propose an evaluation framework that quantifies individual-level prediction instability by using two complementary diagnostics: empirical prediction interval width (ePIW), which captures variability in continuous risk estimates, and empirical decision flip rate (eDFR), which measures instability in threshold-based clinical decisions. We apply these diagnostics to simulated data and GUSTO-I clinical dataset. Across observed settings, we find that for flexible machine-learning models, randomness arising solely from optimization and initialization can induce individual-level variability comparable to that produced by resampling the entire training dataset. Neural networks exhibit substantially greater instability in individual risk predictions compared to logistic regression models. Risk estimate instability near clinically relevant decision thresholds can alter treatment recommendations. These findings that stability diagnostics should be incorporated into routine model validation for assessing clinical reliability.

Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare

TL;DR

An evaluation framework is proposed that quantifies individual-level prediction instability by using two complementary diagnostics: empirical prediction interval width (ePIW), which captures variability in continuous risk estimates, and empirical decision flip rate (eDFR), which measures instability in threshold-based clinical decisions.

Abstract

In healthcare, predictive models increasingly inform patient-level decisions, yet little attention is paid to the variability in individual risk estimates and its impact on treatment decisions. For overparameterized models, now standard in machine learning, a substantial source of variability often goes undetected. Even when the data and model architecture are held fixed, randomness introduced by optimization and initialization can lead to materially different risk estimates for the same patient. This problem is largely obscured by standard evaluation practices, which rely on aggregate performance metrics (e.g., log-loss, accuracy) that are agnostic to individual-level stability. As a result, models with indistinguishable aggregate performance can nonetheless exhibit substantial procedural arbitrariness, which can undermine clinical trust. We propose an evaluation framework that quantifies individual-level prediction instability by using two complementary diagnostics: empirical prediction interval width (ePIW), which captures variability in continuous risk estimates, and empirical decision flip rate (eDFR), which measures instability in threshold-based clinical decisions. We apply these diagnostics to simulated data and GUSTO-I clinical dataset. Across observed settings, we find that for flexible machine-learning models, randomness arising solely from optimization and initialization can induce individual-level variability comparable to that produced by resampling the entire training dataset. Neural networks exhibit substantially greater instability in individual risk predictions compared to logistic regression models. Risk estimate instability near clinically relevant decision thresholds can alter treatment recommendations. These findings that stability diagnostics should be incorporated into routine model validation for assessing clinical reliability.
Paper Structure (19 sections, 6 equations, 5 figures, 5 tables)

This paper contains 19 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Individual-level prediction instability in simulation (top) and the GUSTO-I clinical dataset (bottom). Panels (A--D) report empirical prediction interval width (ePIW); panels (E--H) report empirical decision flip rate (eDFR). Columns correspond to resampled training data versus fixed training data (random-seed variation only), each evaluated at $n_{\text{train}} \in \{500, 5000\}$. In the simulated setting, instability is plotted as a function of true risk; in GUSTO-I, instability is plotted as a function of developed model risk. Across both settings, prediction dispersion is consistently larger for neural networks than for logistic regression and attenuates with increasing training sample size under resampled training data. Decision instability concentrates near clinical decision thresholds and diminishes with increased data and reduced sources of randomness.
  • Figure 2: Individual-level prediction distributions across model families. Each panel displays the empirical distribution of predicted risks ($\widehat{P}_i$) for 12 representative individuals ($x_1$ through $x_{12}$) across $B$ repeated instantiations of the learning pipeline from resampling the training data. From left to right, the panels represent: (1) Log-LBFGS, (2) Log-poly, (3) Log-SGD, (4) NN-1L, and (5) NN-2L. While all models satisfy the competitiveness criterion, the ridge plots reveal varying degrees of individual-level dispersion, particularly as model complexity evolves to neural network architectures.
  • Figure 3: Prediction instability for a fixed out-of-sample individual (true risk $=0.381$) under $B=100$ learning pipeline (LP) instances with $n_{\text{train}}=500$ from simulation data. Each point denotes the predicted risk from one fitted model instance. Left: variability induced by resampling the training data. Right: variability induced by random seed initialization with fixed training data.
  • Figure 4: Predicted risk versus true risk across model classes, training sample sizes, and sources of stochasticity. Each row corresponds to a model specification (Log-LBFGS, Log-SGD, Log-poly, NN-1L, NN-2L). Columns report results under resampled training data and fixed training data (random-seed variation only), each evaluated at $n_{\text{train}} \in \{500, 5000\}$. Points represent individual-level predicted risks evaluated on a common test set across $B=100$ model retrainings. The dashed $45^\circ$ line indicates perfect calibration ($\widehat{p}=p_{\text{true}}$).
  • Figure 5: Individual-level prediction instability, bias, and mean squared error as functions of true risk in the simulated setting. Columns correspond to resampled training data and fixed training data, each evaluated at $n_{\text{train}} \in \{500, 5000\}$. Rows report empirical prediction interval width (ePIW; top), prediction bias (middle), and mean squared error (MSE; bottom), with LOESS smoothing applied to highlight systematic trends across the true-risk spectrum. Curves correspond to different model classes.