Table of Contents
Fetching ...

On Arbitrary Predictions from Equally Valid Models

Sarah Lockfisch, Kristian Schwethelm, Martin Menten, Rickmer Braren, Daniel Rueckert, Alexander Ziller, Georgios Kaissis

Abstract

Model multiplicity refers to the existence of multiple machine learning models that describe the data equally well but may produce different predictions on individual samples. In medicine, these models can admit conflicting predictions for the same patient -- a risk that is poorly understood and insufficiently addressed. In this study, we empirically analyze the extent, drivers, and ramifications of predictive multiplicity across diverse medical tasks and model architectures, and show that even small ensembles can mitigate/eliminate predictive multiplicity in practice. Our analysis reveals that (1) standard validation metrics fail to identify a uniquely optimal model and (2) a substantial amount of predictions hinges on arbitrary choices made during model development. Using multiple models instead of a single model reveals instances where predictions differ across equally plausible models -- highlighting patients that would receive arbitrary diagnoses if any single model were used. In contrast, (3) a small ensemble paired with an abstention strategy can effectively mitigate measurable predictive multiplicity in practice; predictions with high inter-model consensus may thus be amenable to automated classification. While accuracy is not a principled antidote to predictive multiplicity, we find that (4) higher accuracy achieved through increased model capacity reduces predictive multiplicity. Our findings underscore the clinical importance of accounting for model multiplicity and advocate for ensemble-based strategies to improve diagnostic reliability. In cases where models fail to reach sufficient consensus, we recommend deferring decisions to expert review.

On Arbitrary Predictions from Equally Valid Models

Abstract

Model multiplicity refers to the existence of multiple machine learning models that describe the data equally well but may produce different predictions on individual samples. In medicine, these models can admit conflicting predictions for the same patient -- a risk that is poorly understood and insufficiently addressed. In this study, we empirically analyze the extent, drivers, and ramifications of predictive multiplicity across diverse medical tasks and model architectures, and show that even small ensembles can mitigate/eliminate predictive multiplicity in practice. Our analysis reveals that (1) standard validation metrics fail to identify a uniquely optimal model and (2) a substantial amount of predictions hinges on arbitrary choices made during model development. Using multiple models instead of a single model reveals instances where predictions differ across equally plausible models -- highlighting patients that would receive arbitrary diagnoses if any single model were used. In contrast, (3) a small ensemble paired with an abstention strategy can effectively mitigate measurable predictive multiplicity in practice; predictions with high inter-model consensus may thus be amenable to automated classification. While accuracy is not a principled antidote to predictive multiplicity, we find that (4) higher accuracy achieved through increased model capacity reduces predictive multiplicity. Our findings underscore the clinical importance of accounting for model multiplicity and advocate for ensemble-based strategies to improve diagnostic reliability. In cases where models fail to reach sufficient consensus, we recommend deferring decisions to expert review.

Paper Structure

This paper contains 15 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Model multiplicity in clinical decision-making. Multiple machine learning models achieve similar on-average performance and are thus equally valid choices to evaluate a patient's data (model 1 to n). However, these models may produce different risk predictions (low, medium, high) for the same patient, leading to divergent treatment recommendations (monitor, hospital admission, ICU transfer). As a result, a patient's care pathway can vary substantially---ultimately resulting in an arbitrary treatment---depending on which model is used. Rather than relying on a single model, we can adopt an ensemble-based strategy: cases with high inter-model agreement between models may be amenable to automated prediction, while those with low agreement should be reviewed by experts.
  • Figure 2: Variation in accuracy within the empirical Rashomon set. Each plot corresponds to a dataset/architecture combination with points ($\bullet$) representing models trained from different random initialization. Models within the green region ($\bullet$) cannot be distinguished at the 95% significance level. Notably, only two models from the Abdominal CT/ConvNeXt combination show a statistically significant performance difference---an effect that would likely remain undetected without formal statistical testing. The dashed line ($\bullet$) marks the model for which we performed a hyperparameter search. Axes are scaled uniformly within each dataset (0.02 units per grid cell).
  • Figure 3: Prediction stability as a function of accuracy. For clarity, samples are binned by per-sample accuracy and APPA; point size reflects their relative frequency (normalized by dataset size). Color encodes stability: ($\bullet$) marks stable samples (APPA = 1.0), whereas ($\bullet$) correspond to unstable samples (APPA $<$ 1.0). Across datasets and architectures, high accuracy across the models in the empirical Rashomon set consistently exhibits lower APPA (top-right corner). However, samples in the top-left corner are consistently misclassified, revealing systematic failure modes.
  • Figure 4: Changes by increasing model capacity (only affected samples). Samples are grouped as (I) correct-stable, (II) unstable, and (III) incorrect-stable. Bundles connect category transitions from EfficientNetB0 (left) to EfficientNetB4 (right), illustrating how samples shift between groups. Pink bundles mark the cost of model switching (previously stable-correct samples becoming unstable) while green bundles indicate newly detectable errors (previously consistently misclassified samples become unstable and thus identifiable). For Blood Cell the effect is minimal (only 2.6% of samples are affected), for Abdominal CT, OCT Scan, and Breast Ultrasound overall utility improves, for X-ray utility stays the same, but different samples are affected.
  • Figure 5: Ensembles substantially reduce predictive multiplicity. Coverage rate ($\bullet$) and expected pairwise agreement ($\bullet$) of the test set across single models and ensembles of sizes 2, 5, and 10. For coverage, the darker segment indicates the correctly covered samples. For expected pairwise agreement, the darker segment represents stable samples, while the lighter segment corresponds to samples for which no judgment can be made due to missing coverage from the other ensemble. Error bars show standard deviation. Note the X-ray dataset: ensembles reduce predictive multiplicity but retain consistently incorrect predictions, reflecting shared systematic biases.
  • ...and 1 more figures