Table of Contents
Fetching ...

Data-Driven Machine Learning Approaches for Predicting In-Hospital Sepsis Mortality

Arseniy Shumilov, Yueting Zhu, Negin Ashrafi, Armin Abdollahi, Greg Placencia, Kamiar Alaei, Maryam Pishgar

TL;DR

The paper addresses the challenge of predicting in-hospital sepsis mortality with interpretable, high-accuracy models. It uses the MIMIC-III ICU dataset, applies a Random Forest–driven feature-selection pipeline to identify 35 key predictors, and employs SMOTE and rigorous validation across five classifiers, with Random Forest achieving AUROC of 0.97 and accuracy of 0.90. SHAP analysis provides explainability by highlighting influential features and their interactions, enhancing clinical trust. The work demonstrates that data-driven approaches can yield actionable, interpretable mortality predictions, with potential to support timely clinical decision-making and improve patient outcomes.

Abstract

Sepsis is a severe condition responsible for many deaths in the United States and worldwide, making accurate prediction of outcomes crucial for timely and effective treatment. Previous studies employing machine learning faced limitations in feature selection and model interpretability, reducing their clinical applicability. This research aimed to develop an interpretable and accurate machine learning model to predict in-hospital sepsis mortality, addressing these gaps. Using ICU patient records from the MIMIC-III database, we extracted relevant data through a combination of literature review, clinical input refinement, and Random Forest-based feature selection, identifying the top 35 features. Data preprocessing included cleaning, imputation, standardization, and applying the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance, resulting in a dataset of 4,683 patients with 17,429 admissions. Five models-Random Forest, Gradient Boosting, Logistic Regression, Support Vector Machine, and K-Nearest Neighbor-were developed and evaluated. The Random Forest model demonstrated the best performance, achieving an accuracy of 0.90, AUROC of 0.97, precision of 0.93, recall of 0.91, and F1-score of 0.92. These findings underscore the potential of data-driven machine learning approaches to improve critical care, offering clinicians a powerful tool for predicting in-hospital sepsis mortality and enhancing patient outcomes.

Data-Driven Machine Learning Approaches for Predicting In-Hospital Sepsis Mortality

TL;DR

The paper addresses the challenge of predicting in-hospital sepsis mortality with interpretable, high-accuracy models. It uses the MIMIC-III ICU dataset, applies a Random Forest–driven feature-selection pipeline to identify 35 key predictors, and employs SMOTE and rigorous validation across five classifiers, with Random Forest achieving AUROC of 0.97 and accuracy of 0.90. SHAP analysis provides explainability by highlighting influential features and their interactions, enhancing clinical trust. The work demonstrates that data-driven approaches can yield actionable, interpretable mortality predictions, with potential to support timely clinical decision-making and improve patient outcomes.

Abstract

Sepsis is a severe condition responsible for many deaths in the United States and worldwide, making accurate prediction of outcomes crucial for timely and effective treatment. Previous studies employing machine learning faced limitations in feature selection and model interpretability, reducing their clinical applicability. This research aimed to develop an interpretable and accurate machine learning model to predict in-hospital sepsis mortality, addressing these gaps. Using ICU patient records from the MIMIC-III database, we extracted relevant data through a combination of literature review, clinical input refinement, and Random Forest-based feature selection, identifying the top 35 features. Data preprocessing included cleaning, imputation, standardization, and applying the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance, resulting in a dataset of 4,683 patients with 17,429 admissions. Five models-Random Forest, Gradient Boosting, Logistic Regression, Support Vector Machine, and K-Nearest Neighbor-were developed and evaluated. The Random Forest model demonstrated the best performance, achieving an accuracy of 0.90, AUROC of 0.97, precision of 0.93, recall of 0.91, and F1-score of 0.92. These findings underscore the potential of data-driven machine learning approaches to improve critical care, offering clinicians a powerful tool for predicting in-hospital sepsis mortality and enhancing patient outcomes.
Paper Structure (13 sections, 9 figures, 4 tables)

This paper contains 13 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Patient Selection Process Graphical representation of patient inclusion criteria.
  • Figure 2: Data Extraction Process Graphical representation of health data and lab indicators Extraction
  • Figure 3: Aggregation and Pivoting Process Graphical representation of data aggregation and pivoting for patient features
  • Figure 4: Feature Importance Graphical Representation of the Top 35 Most Important Features Identified by a Random Forest
  • Figure 5: Data Processing and Model Training Workflow Graphical representation of the steps from data extraction to the model prediction
  • ...and 4 more figures