Stronger Baseline Models -- A Key Requirement for Aligning Machine Learning Research with Clinical Utility

Nathan Wolfrath; Joel Wolfrath; Hengrui Hu; Anjishnu Banerjee; Anai N. Kothari

Stronger Baseline Models -- A Key Requirement for Aligning Machine Learning Research with Clinical Utility

Nathan Wolfrath, Joel Wolfrath, Hengrui Hu, Anjishnu Banerjee, Anai N. Kothari

TL;DR

The paper addresses how the lack of strong, well-tuned baselines can obscure the true value of complex ML methods in healthcare. By analyzing five case studies, it shows that robust baselines often match or exceed the performance of sophisticated models, revealing when added complexity is unnecessary and highlighting issues of generalization and interpretability. It then offers a practical evaluation framework and best practices for constructing, reporting, and reasoning about baselines to better align ML research with clinical utility. This approach aims to reduce deployment barriers by improving transparency, comparability, and relevance of ML models in real-world healthcare settings.

Abstract

Machine Learning (ML) research has increased substantially in recent years, due to the success of predictive modeling across diverse application domains. However, well-known barriers exist when attempting to deploy ML models in high-stakes, clinical settings, including lack of model transparency (or the inability to audit the inference process), large training data requirements with siloed data sources, and complicated metrics for measuring model utility. In this work, we show empirically that including stronger baseline models in healthcare ML evaluations has important downstream effects that aid practitioners in addressing these challenges. Through a series of case studies, we find that the common practice of omitting baselines or comparing against a weak baseline model (e.g. a linear model with no optimization) obscures the value of ML methods proposed in the research literature. Using these insights, we propose some best practices that will enable practitioners to more effectively study and deploy ML models in clinical settings.

Stronger Baseline Models -- A Key Requirement for Aligning Machine Learning Research with Clinical Utility

TL;DR

Abstract

Paper Structure (9 sections, 5 figures, 1 table)

This paper contains 9 sections, 5 figures, 1 table.

Introduction
Statistical Baselines
Case Studies
PCR Testing
Heart Disease Prediction
Gastrectomy Mortality
SARS-CoV-2 Mortality
Sepsis Forecasting
Discussion

Figures (5)

Figure 1: Model comparison for SARS-CoV-2 PCR testing
Figure 2: Model performance on Cleveland heart disease data. The transformer-based approach demonstrates superior performance on most metrics.
Figure 3: Model performance on data external to Cleveland dataset. Performance is similar across model types, with LR attaining the highest AU-ROC, sensitivity, and accuracy, and the transformer model the highest specificity.
Figure 4: Model comparison for postoperative 90-day mortality
Figure 5: Model comparison for risk of SARS-CoV-2 case fatality. Weighted LR and Weighted GAM attain slightly lower accuracy, with a higher sensitivity and AU-ROC compared to the proposed autoencoder.

Stronger Baseline Models -- A Key Requirement for Aligning Machine Learning Research with Clinical Utility

TL;DR

Abstract

Stronger Baseline Models -- A Key Requirement for Aligning Machine Learning Research with Clinical Utility

Authors

TL;DR

Abstract

Table of Contents

Figures (5)