A Comparative Study of Diabetes Prediction Based on Lifestyle Factors Using Machine Learning
Bruce Nguyen, Yan Zhang
TL;DR
The paper tackles diabetes risk prediction using lifestyle data from the BRFSS 2015 survey and compares three classifiers—Decision Tree, KNN, and Logistic Regression—on a balanced binary task derived from a 3-class target (no diabetes vs. prediabetes/diabetes). Feature preprocessing includes min-max normalization for numeric attributes and a dataset split of 80/20 for training and testing, with model-specific hyperparameter tuning via grid search and cross-validation. Results show Logistic Regression achieving the highest accuracy of $0.751$, followed by Decision Tree at $0.741$ and KNN at $0.721$, with LR offering the most balanced precision-recall performance. The study demonstrates the feasibility of ML-based diabetes risk stratification from lifestyle factors and points to enhancements through expanded feature sets and ensemble methods to improve predictive robustness and interpretability.
Abstract
Diabetes is a prevalent chronic disease with significant health and economic burdens worldwide. Early prediction and diagnosis can aid in effective management and prevention of complications. This study explores the use of machine learning models to predict diabetes based on lifestyle factors using data from the Behavioral Risk Factor Surveillance System (BRFSS) 2015 survey. The dataset consists of 21 lifestyle and health-related features, capturing aspects such as physical activity, diet, mental health, and socioeconomic status. Three classification models, Decision Tree, K-Nearest Neighbors (KNN), and Logistic Regression, are implemented and evaluated to determine their predictive performance. The models are trained and tested using a balanced dataset, and their performances are assessed based on accuracy, precision, recall, and F1-score. The results indicate that the Decision Tree, KNN, and Logistic Regression achieve an accuracy of 0.74, 0.72, and 0.75, respectively, with varying strengths in precision and recall. The findings highlight the potential of machine learning in diabetes prediction and suggest future improvements through feature selection and ensemble learning techniques.
