Table of Contents
Fetching ...

Machine Learning Models for Predicting Smoking-Related Health Decline and Disease Risk

Vaskar Chakma, MD Jaheid Hasan Nerab, Abdur Rouf, Abu Sayed, Hossem MD Saim, Md. Nournabi Khan

TL;DR

This study conducts a rigorous, cross-sectional evaluation of five machine learning models (notably Random Forest) for predicting smoking-related health decline using a large multi-system health screening dataset. By prioritizing interpretability with SHAP, benchmarking against the Framingham risk score, and addressing class imbalance, it demonstrates that Random Forest yields superior discrimination (AUC-ROC ≈ 0.926, AUC-PR ≈ 0.880) and balanced sensitivity/specificity, while revealing hepatic and metabolic biomarkers as key risk drivers. The work emphasizes clinically actionable insights, such as sex-specific effects and multi-system injury patterns, and discusses deployment considerations, fairness, and the need for prospective validation to translate these tools into practice. Collectively, the findings advocate for integrated, interpretable risk assessment frameworks to enable earlier, personalized interventions for smokers, moving beyond single-endpoint screening toward multi-domain precision prevention.

Abstract

Smoking continues to be a major preventable cause of death worldwide, affecting millions through damage to the heart, metabolism, liver, and kidneys. However, current medical screening methods often miss the early warning signs of smoking-related health problems, leading to late-stage diagnoses when treatment options become limited. This study presents a systematic comparative evaluation of machine learning approaches for smoking-related health risk assessment, emphasizing clinical interpretability and practical deployment over algorithmic innovation. We analyzed health screening data from 55,691 individuals, examining various health indicators, including body measurements, blood tests, and demographic information. We tested three advanced prediction algorithms - Random Forest, XGBoost, and LightGBM - to determine which could most accurately identify people at high risk. This study employed a cross-sectional design to classify current smoking status based on health screening biomarkers, not to predict future disease development. Our Random Forest model performed best, achieving an Area Under the Curve (AUC) of 0.926, meaning it could reliably distinguish between high-risk and lower-risk individuals. Using SHAP (SHapley Additive exPlanations) analysis to understand what the model was detecting, we found that key health markers played crucial roles in prediction: blood pressure levels, triglyceride concentrations, liver enzyme readings, and kidney function indicators (serum creatinine) were the strongest signals of declining health in smokers.

Machine Learning Models for Predicting Smoking-Related Health Decline and Disease Risk

TL;DR

This study conducts a rigorous, cross-sectional evaluation of five machine learning models (notably Random Forest) for predicting smoking-related health decline using a large multi-system health screening dataset. By prioritizing interpretability with SHAP, benchmarking against the Framingham risk score, and addressing class imbalance, it demonstrates that Random Forest yields superior discrimination (AUC-ROC ≈ 0.926, AUC-PR ≈ 0.880) and balanced sensitivity/specificity, while revealing hepatic and metabolic biomarkers as key risk drivers. The work emphasizes clinically actionable insights, such as sex-specific effects and multi-system injury patterns, and discusses deployment considerations, fairness, and the need for prospective validation to translate these tools into practice. Collectively, the findings advocate for integrated, interpretable risk assessment frameworks to enable earlier, personalized interventions for smokers, moving beyond single-endpoint screening toward multi-domain precision prevention.

Abstract

Smoking continues to be a major preventable cause of death worldwide, affecting millions through damage to the heart, metabolism, liver, and kidneys. However, current medical screening methods often miss the early warning signs of smoking-related health problems, leading to late-stage diagnoses when treatment options become limited. This study presents a systematic comparative evaluation of machine learning approaches for smoking-related health risk assessment, emphasizing clinical interpretability and practical deployment over algorithmic innovation. We analyzed health screening data from 55,691 individuals, examining various health indicators, including body measurements, blood tests, and demographic information. We tested three advanced prediction algorithms - Random Forest, XGBoost, and LightGBM - to determine which could most accurately identify people at high risk. This study employed a cross-sectional design to classify current smoking status based on health screening biomarkers, not to predict future disease development. Our Random Forest model performed best, achieving an Area Under the Curve (AUC) of 0.926, meaning it could reliably distinguish between high-risk and lower-risk individuals. Using SHAP (SHapley Additive exPlanations) analysis to understand what the model was detecting, we found that key health markers played crucial roles in prediction: blood pressure levels, triglyceride concentrations, liver enzyme readings, and kidney function indicators (serum creatinine) were the strongest signals of declining health in smokers.

Paper Structure

This paper contains 34 sections, 1 equation, 8 figures, 9 tables, 3 algorithms.

Figures (8)

  • Figure 1: Disentangling the Interdependent Relationships Among Health Indicators.
  • Figure 2: Visualization of the effects of age and BMI (left) and systolic blood pressure (SBP) and BMI (right) on health risk predictions, highlighting the non-linear relationships between these factors and their impact on smoking-related health decline.
  • Figure 3: The ROC curve analysis compares the predictive performances of various machine learning models for smoking-related health decline, with higher AUC values indicating better accuracy in risk differentiation.
  • Figure 4: This heatmap visually represents the correlation between different health metrics, such as blood pressure, cholesterol, and organ function markers. Color intensities indicate the strength and direction of the relationships, with red hues indicating positive correlations and blue hues indicating negative correlations.
  • Figure 5: SHAP summary plot illustrating feature importance (vertical axis) and directional impact (horizontal axis) for the top 15 health indicators. Each point represents an individual from the test set. Red colors indicate high feature values, blue indicates low values. Features above zero increase predicted smoking risk, while those below decrease it. The violin plot width shows the density of observations. Gender emerges as the dominant predictor, followed by hepatic markers (GGT, ALT, AST) and metabolic indicators (hemoglobin, triglycerides).
  • ...and 3 more figures