Table of Contents
Fetching ...

Optimizing Stroke Risk Prediction: A Machine Learning Pipeline Combining ROS-Balanced Ensembles and XAI

A S M Ahsanul Sarkar Akib, Raduana Khawla, Abdul Hasib

TL;DR

The study tackles stroke risk prediction by integrating Random Over-Sampling balanced ensembles (Random Forest, ExtraTrees, XGBoost) with explainable AI (LIME). It conducts rigorous evaluation via 5-fold cross-validation across SPD and SDP, culminating in a soft voting ensemble that achieves 99.09% accuracy on SPD and strong performance on SDP, while providing local explanations for key predictors like age, hypertension, and glucose. The approach addresses data imbalance and interpretability concerns, showing potential for data-driven, personalized clinical decision support. Limitations include data quality and external validity, with future work favoring multicenter data integration, deeper feature extraction, and cloud-based deployment to enhance generalizability and practicality.

Abstract

Stroke is a major cause of death and permanent impairment, making it a major worldwide health concern. For prompt intervention and successful preventative tactics, early risk assessment is essential. To address this challenge, we used ensemble modeling and explainable AI (XAI) techniques to create an interpretable machine learning framework for stroke risk prediction. A thorough evaluation of 10 different machine learning models using 5-fold cross-validation across several datasets was part of our all-inclusive strategy, which also included feature engineering and data pretreatment (using Random Over-Sampling (ROS) to solve class imbalance). Our optimized ensemble model (Random Forest + ExtraTrees + XGBoost) performed exceptionally well, obtaining a strong 99.09% accuracy on the Stroke Prediction Dataset (SPD). We improved the model's transparency and clinical applicability by identifying three important clinical variables using LIME-based interpretability analysis: age, hypertension, and glucose levels. Through early prediction, this study highlights how combining ensemble learning with explainable AI (XAI) can deliver highly accurate and interpretable stroke risk assessment. By enabling data-driven prevention and personalized clinical decisions, our framework has the potential to transform stroke prediction and cardiovascular risk management.

Optimizing Stroke Risk Prediction: A Machine Learning Pipeline Combining ROS-Balanced Ensembles and XAI

TL;DR

The study tackles stroke risk prediction by integrating Random Over-Sampling balanced ensembles (Random Forest, ExtraTrees, XGBoost) with explainable AI (LIME). It conducts rigorous evaluation via 5-fold cross-validation across SPD and SDP, culminating in a soft voting ensemble that achieves 99.09% accuracy on SPD and strong performance on SDP, while providing local explanations for key predictors like age, hypertension, and glucose. The approach addresses data imbalance and interpretability concerns, showing potential for data-driven, personalized clinical decision support. Limitations include data quality and external validity, with future work favoring multicenter data integration, deeper feature extraction, and cloud-based deployment to enhance generalizability and practicality.

Abstract

Stroke is a major cause of death and permanent impairment, making it a major worldwide health concern. For prompt intervention and successful preventative tactics, early risk assessment is essential. To address this challenge, we used ensemble modeling and explainable AI (XAI) techniques to create an interpretable machine learning framework for stroke risk prediction. A thorough evaluation of 10 different machine learning models using 5-fold cross-validation across several datasets was part of our all-inclusive strategy, which also included feature engineering and data pretreatment (using Random Over-Sampling (ROS) to solve class imbalance). Our optimized ensemble model (Random Forest + ExtraTrees + XGBoost) performed exceptionally well, obtaining a strong 99.09% accuracy on the Stroke Prediction Dataset (SPD). We improved the model's transparency and clinical applicability by identifying three important clinical variables using LIME-based interpretability analysis: age, hypertension, and glucose levels. Through early prediction, this study highlights how combining ensemble learning with explainable AI (XAI) can deliver highly accurate and interpretable stroke risk assessment. By enabling data-driven prevention and personalized clinical decisions, our framework has the potential to transform stroke prediction and cardiovascular risk management.

Paper Structure

This paper contains 19 sections, 2 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: The Proposed Methodology Diagram for Stroke Risk Prediction.
  • Figure 2: The Stroke Distribution of two Datasets
  • Figure 3: LIME Explanation for Ensemble Model Using Dataset SPD.
  • Figure 4: LIME Explanation for Ensemble Model Using Dataset SDP.