Interpretable Heart Disease Prediction via a Weighted Ensemble Model: A Large-Scale Study with SHAP and Surrogate Decision Trees
Md Abrar Hasnat, Md Jobayer, Md. Mehedi Hasan Shawon, Md. Golam Rabiul Alam
TL;DR
Cardiovascular disease risk prediction requires accurate and interpretable models on large-scale public-health data. The authors propose a strategically weighted ensemble that combines LightGBM, XGBoost, and a CNN, with explicit class-imbalance handling and clinically meaningful feature engineering (expanding from 22 to 25 features). The ensemble achieves a statistically significant improvement over the best single model (Test AUC $0.8371$, $p=0.003$) while maintaining high recall ($0.80$), making it suitable for screening contexts. Interpretability is addressed via SHAP (global and local explanations) and a surrogate decision tree that distills the model logic into actionable rules, with BMI_BP_Interaction identified as a root driver of risk. Collectively, the work demonstrates a scalable, transparent approach to deploying high-performance cardiovascular risk prediction tools in real-world screening settings.
Abstract
Cardiovascular disease (CVD) remains a critical global health concern, demanding reliable and interpretable predictive models for early risk assessment. This study presents a large-scale analysis using the Heart Disease Health Indicators Dataset, developing a strategically weighted ensemble model that combines tree-based methods (LightGBM, XGBoost) with a Convolutional Neural Network (CNN) to predict CVD risk. The model was trained on a preprocessed dataset of 229,781 patients where the inherent class imbalance was managed through strategic weighting and feature engineering enhanced the original 22 features to 25. The final ensemble achieves a statistically significant improvement over the best individual model, with a Test AUC of 0.8371 (p=0.003) and is particularly suited for screening with a high recall of 80.0%. To provide transparency and clinical interpretability, surrogate decision trees and SHapley Additive exPlanations (SHAP) are used. The proposed model delivers a combination of robust predictive performance and clinical transparency by blending diverse learning architectures and incorporating explainability through SHAP and surrogate decision trees, making it a strong candidate for real-world deployment in public health screening.
