Feature-Enhanced Machine Learning for All-Cause Mortality Prediction in Healthcare Data
HyeYoung Lee, Pavel Tsoi
TL;DR
This work tackles the challenge of predicting all-cause in-hospital mortality in ICU patients using the MIMIC-III dataset. It introduces a feature-engineered, Random Forest-based pipeline with LASSO feature selection, SMOTE data augmentation, and Grid Search hyperparameter tuning, complemented by SHAP-based interpretability. The approach achieves a high discriminatory performance (AUC $=0.94$) and strong clinical explainability, underscoring the value of careful feature engineering over more complex deep learning models in noisy, high-dimensional EHR data. The findings support deployment of robust clinical decision support tools and point to future directions in external validation and disease-specific customization to enhance generalizability and impact.
Abstract
Accurate patient mortality prediction enables effective risk stratification, leading to personalized treatment plans and improved patient outcomes. However, predicting mortality in healthcare remains a significant challenge, with existing studies often focusing on specific diseases or limited predictor sets. This study evaluates machine learning models for all-cause in-hospital mortality prediction using the MIMIC-III database, employing a comprehensive feature engineering approach. Guided by clinical expertise and literature, we extracted key features such as vital signs (e.g., heart rate, blood pressure), laboratory results (e.g., creatinine, glucose), and demographic information. The Random Forest model achieved the highest performance with an AUC of 0.94, significantly outperforming other machine learning and deep learning approaches. This demonstrates Random Forest's robustness in handling high-dimensional, noisy clinical data and its potential for developing effective clinical decision support tools. Our findings highlight the importance of careful feature engineering for accurate mortality prediction. We conclude by discussing implications for clinical adoption and propose future directions, including enhancing model robustness and tailoring prediction models for specific diseases.
