Table of Contents
Fetching ...

An Improved Ensemble-Based Machine Learning Model with Feature Optimization for Early Diabetes Prediction

Md. Najmul Islam, Md. Miner Hossain Rimon, Shah Sadek-E-Akbor Shamim, Zarif Mohaimen Fahad, Md. Jehadul Islam Mony, Md. Jalal Uddin Chowdhury

TL;DR

The paper addresses early diabetes prediction using a large BRFSS dataset, mitigating class imbalance with SMOTE-Tomek and enhancing performance via a stacking ensemble of XGBoost and KNN, with LightGBM as a meta-learner. It presents a thorough preprocessing, feature selection, and model-training pipeline, achieving 94.82% accuracy and ROC-AUC of 0.989 on unseen data, outperforming prior works on BRFSS-based tasks. A deployable web app with a chatbot supports real-time risk assessment and interpretable feature insights, bridging advanced analytics and clinical practice. While showing strong results, the study discusses limitations related to survey bias and generalizability, recommending future cross-population validation and alternative imbalance strategies.

Abstract

Diabetes is a serious worldwide health issue, and successful intervention depends on early detection. However, overlapping risk factors and data asymmetry make prediction difficult. To use extensive health survey data to create a machine learning framework for diabetes classification that is both accurate and comprehensible, to produce results that will aid in clinical decision-making. Using the BRFSS dataset, we assessed a number of supervised learning techniques. SMOTE and Tomek Links were used to correct class imbalance. To improve prediction performance, both individual models and ensemble techniques such as stacking were investigated. The 2015 BRFSS dataset, which includes roughly 253,680 records with 22 numerical features, is used in this study. Strong ROC-AUC performance of approximately 0.96 was attained by the individual models Random Forest, XGBoost, CatBoost, and LightGBM.The stacking ensemble with XGBoost and KNN yielded the best overall results with 94.82\% accuracy, ROC-AUC of 0.989, and PR-AUC of 0.991, indicating a favourable balance between recall and precision. In our study, we proposed and developed a React Native-based application with a Python Flask backend to support early diabetes prediction, providing users with an accessible and efficient health monitoring tool.

An Improved Ensemble-Based Machine Learning Model with Feature Optimization for Early Diabetes Prediction

TL;DR

The paper addresses early diabetes prediction using a large BRFSS dataset, mitigating class imbalance with SMOTE-Tomek and enhancing performance via a stacking ensemble of XGBoost and KNN, with LightGBM as a meta-learner. It presents a thorough preprocessing, feature selection, and model-training pipeline, achieving 94.82% accuracy and ROC-AUC of 0.989 on unseen data, outperforming prior works on BRFSS-based tasks. A deployable web app with a chatbot supports real-time risk assessment and interpretable feature insights, bridging advanced analytics and clinical practice. While showing strong results, the study discusses limitations related to survey bias and generalizability, recommending future cross-population validation and alternative imbalance strategies.

Abstract

Diabetes is a serious worldwide health issue, and successful intervention depends on early detection. However, overlapping risk factors and data asymmetry make prediction difficult. To use extensive health survey data to create a machine learning framework for diabetes classification that is both accurate and comprehensible, to produce results that will aid in clinical decision-making. Using the BRFSS dataset, we assessed a number of supervised learning techniques. SMOTE and Tomek Links were used to correct class imbalance. To improve prediction performance, both individual models and ensemble techniques such as stacking were investigated. The 2015 BRFSS dataset, which includes roughly 253,680 records with 22 numerical features, is used in this study. Strong ROC-AUC performance of approximately 0.96 was attained by the individual models Random Forest, XGBoost, CatBoost, and LightGBM.The stacking ensemble with XGBoost and KNN yielded the best overall results with 94.82\% accuracy, ROC-AUC of 0.989, and PR-AUC of 0.991, indicating a favourable balance between recall and precision. In our study, we proposed and developed a React Native-based application with a Python Flask backend to support early diabetes prediction, providing users with an accessible and efficient health monitoring tool.

Paper Structure

This paper contains 15 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overall Workflow of Our Proposed Model
  • Figure 2: Performance of the Stacking Model: Confusion Matrix
  • Figure 3: Comparison of ROC Curves for All Models
  • Figure 4: Comparison of Precision–Recall Curves for All Models
  • Figure 5: User-provided clinical factors and model prediction results.
  • ...and 2 more figures