Bayesian Networks and Machine Learning for COVID-19 Severity Explanation and Demographic Symptom Classification
Oluwaseun T. Ajayi, Yu Cheng
TL;DR
The paper tackles understanding COVID-19 severity and demographic symptom patterns by presenting a three-stage data-driven framework that combines Bayesian network structure learning, unsupervised clustering, and a DSID predictor. It analyzes a CDC US dataset of 537,243 cases with 24 features, learning a BN to capture conditional dependencies and CPDs, clustering symptom patterns to reveal classes, and a DSID model that maps symptom classes to demographic distributions with high accuracy. The approach yields interpretable CPDs for ICU, ventilation, death, and medical conditions, and demonstrates near-perfect predictive accuracy (99.99%) for demographic class inference on held-out data, outperforming a heuristic baseline. The work provides actionable insights into symptom-demographic relationships and offers a practical pathway for risk stratification and targeted public-health interventions.
Abstract
With the prevailing efforts to combat the coronavirus disease 2019 (COVID-19) pandemic, there are still uncertainties that are yet to be discovered about its spread, future impact, and resurgence. In this paper, we present a three-stage data-driven approach to distill the hidden information about COVID-19. The first stage employs a Bayesian network structure learning method to identify the causal relationships among COVID-19 symptoms and their intrinsic demographic variables. As a second stage, the output from the Bayesian network structure learning, serves as a useful guide to train an unsupervised machine learning (ML) algorithm that uncovers the similarities in patients' symptoms through clustering. The final stage then leverages the labels obtained from clustering to train a demographic symptom identification (DSID) model which predicts a patient's symptom class and the corresponding demographic probability distribution. We applied our method on the COVID-19 dataset obtained from the Centers for Disease Control and Prevention (CDC) in the United States. Results from the experiments show a testing accuracy of 99.99%, as against the 41.15% accuracy of a heuristic ML method. This strongly reveals the viability of our Bayesian network and ML approach in understanding the relationship between the virus symptoms, and providing insights on patients' stratification towards reducing the severity of the virus.
