Data-Driven Machine Learning Approaches for Predicting In-Hospital Sepsis Mortality

Arseniy Shumilov; Yueting Zhu; Negin Ashrafi; Armin Abdollahi; Greg Placencia; Kamiar Alaei; Maryam Pishgar

Data-Driven Machine Learning Approaches for Predicting In-Hospital Sepsis Mortality

Arseniy Shumilov, Yueting Zhu, Negin Ashrafi, Armin Abdollahi, Greg Placencia, Kamiar Alaei, Maryam Pishgar

TL;DR

The paper addresses the challenge of predicting in-hospital sepsis mortality with interpretable, high-accuracy models. It uses the MIMIC-III ICU dataset, applies a Random Forest–driven feature-selection pipeline to identify 35 key predictors, and employs SMOTE and rigorous validation across five classifiers, with Random Forest achieving AUROC of 0.97 and accuracy of 0.90. SHAP analysis provides explainability by highlighting influential features and their interactions, enhancing clinical trust. The work demonstrates that data-driven approaches can yield actionable, interpretable mortality predictions, with potential to support timely clinical decision-making and improve patient outcomes.

Abstract

Sepsis is a severe condition responsible for many deaths in the United States and worldwide, making accurate prediction of outcomes crucial for timely and effective treatment. Previous studies employing machine learning faced limitations in feature selection and model interpretability, reducing their clinical applicability. This research aimed to develop an interpretable and accurate machine learning model to predict in-hospital sepsis mortality, addressing these gaps. Using ICU patient records from the MIMIC-III database, we extracted relevant data through a combination of literature review, clinical input refinement, and Random Forest-based feature selection, identifying the top 35 features. Data preprocessing included cleaning, imputation, standardization, and applying the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance, resulting in a dataset of 4,683 patients with 17,429 admissions. Five models-Random Forest, Gradient Boosting, Logistic Regression, Support Vector Machine, and K-Nearest Neighbor-were developed and evaluated. The Random Forest model demonstrated the best performance, achieving an accuracy of 0.90, AUROC of 0.97, precision of 0.93, recall of 0.91, and F1-score of 0.92. These findings underscore the potential of data-driven machine learning approaches to improve critical care, offering clinicians a powerful tool for predicting in-hospital sepsis mortality and enhancing patient outcomes.

Data-Driven Machine Learning Approaches for Predicting In-Hospital Sepsis Mortality

TL;DR

Abstract

Paper Structure (13 sections, 9 figures, 4 tables)

This paper contains 13 sections, 9 figures, 4 tables.

Introduction
Methodology
Data preprocessing
Feature selection and feature importance
Model development and optimization
Statistical analysis between cohors
Results
Evaluation metrics proposed and baseline models’ performance
Shapley Value analysis
Discussion
Summary of existing model compilation
Study limitations and future research
Conclusion

Figures (9)

Figure 1: Patient Selection Process Graphical representation of patient inclusion criteria.
Figure 2: Data Extraction Process Graphical representation of health data and lab indicators Extraction
Figure 3: Aggregation and Pivoting Process Graphical representation of data aggregation and pivoting for patient features
Figure 4: Feature Importance Graphical Representation of the Top 35 Most Important Features Identified by a Random Forest
Figure 5: Data Processing and Model Training Workflow Graphical representation of the steps from data extraction to the model prediction
...and 4 more figures

Data-Driven Machine Learning Approaches for Predicting In-Hospital Sepsis Mortality

TL;DR

Abstract

Data-Driven Machine Learning Approaches for Predicting In-Hospital Sepsis Mortality

Authors

TL;DR

Abstract

Table of Contents

Figures (9)