Table of Contents
Fetching ...

Bayesian Networks and Machine Learning for COVID-19 Severity Explanation and Demographic Symptom Classification

Oluwaseun T. Ajayi, Yu Cheng

TL;DR

The paper tackles understanding COVID-19 severity and demographic symptom patterns by presenting a three-stage data-driven framework that combines Bayesian network structure learning, unsupervised clustering, and a DSID predictor. It analyzes a CDC US dataset of 537,243 cases with 24 features, learning a BN to capture conditional dependencies and CPDs, clustering symptom patterns to reveal classes, and a DSID model that maps symptom classes to demographic distributions with high accuracy. The approach yields interpretable CPDs for ICU, ventilation, death, and medical conditions, and demonstrates near-perfect predictive accuracy (99.99%) for demographic class inference on held-out data, outperforming a heuristic baseline. The work provides actionable insights into symptom-demographic relationships and offers a practical pathway for risk stratification and targeted public-health interventions.

Abstract

With the prevailing efforts to combat the coronavirus disease 2019 (COVID-19) pandemic, there are still uncertainties that are yet to be discovered about its spread, future impact, and resurgence. In this paper, we present a three-stage data-driven approach to distill the hidden information about COVID-19. The first stage employs a Bayesian network structure learning method to identify the causal relationships among COVID-19 symptoms and their intrinsic demographic variables. As a second stage, the output from the Bayesian network structure learning, serves as a useful guide to train an unsupervised machine learning (ML) algorithm that uncovers the similarities in patients' symptoms through clustering. The final stage then leverages the labels obtained from clustering to train a demographic symptom identification (DSID) model which predicts a patient's symptom class and the corresponding demographic probability distribution. We applied our method on the COVID-19 dataset obtained from the Centers for Disease Control and Prevention (CDC) in the United States. Results from the experiments show a testing accuracy of 99.99%, as against the 41.15% accuracy of a heuristic ML method. This strongly reveals the viability of our Bayesian network and ML approach in understanding the relationship between the virus symptoms, and providing insights on patients' stratification towards reducing the severity of the virus.

Bayesian Networks and Machine Learning for COVID-19 Severity Explanation and Demographic Symptom Classification

TL;DR

The paper tackles understanding COVID-19 severity and demographic symptom patterns by presenting a three-stage data-driven framework that combines Bayesian network structure learning, unsupervised clustering, and a DSID predictor. It analyzes a CDC US dataset of 537,243 cases with 24 features, learning a BN to capture conditional dependencies and CPDs, clustering symptom patterns to reveal classes, and a DSID model that maps symptom classes to demographic distributions with high accuracy. The approach yields interpretable CPDs for ICU, ventilation, death, and medical conditions, and demonstrates near-perfect predictive accuracy (99.99%) for demographic class inference on held-out data, outperforming a heuristic baseline. The work provides actionable insights into symptom-demographic relationships and offers a practical pathway for risk stratification and targeted public-health interventions.

Abstract

With the prevailing efforts to combat the coronavirus disease 2019 (COVID-19) pandemic, there are still uncertainties that are yet to be discovered about its spread, future impact, and resurgence. In this paper, we present a three-stage data-driven approach to distill the hidden information about COVID-19. The first stage employs a Bayesian network structure learning method to identify the causal relationships among COVID-19 symptoms and their intrinsic demographic variables. As a second stage, the output from the Bayesian network structure learning, serves as a useful guide to train an unsupervised machine learning (ML) algorithm that uncovers the similarities in patients' symptoms through clustering. The final stage then leverages the labels obtained from clustering to train a demographic symptom identification (DSID) model which predicts a patient's symptom class and the corresponding demographic probability distribution. We applied our method on the COVID-19 dataset obtained from the Centers for Disease Control and Prevention (CDC) in the United States. Results from the experiments show a testing accuracy of 99.99%, as against the 41.15% accuracy of a heuristic ML method. This strongly reveals the viability of our Bayesian network and ML approach in understanding the relationship between the virus symptoms, and providing insights on patients' stratification towards reducing the severity of the virus.
Paper Structure (18 sections, 7 equations, 10 figures, 6 tables)

This paper contains 18 sections, 7 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Proposed three-stage framework of BN and ML for COVID-19 severity explanation and demographic classification. In stage 1, the BNs of the severity variables and demographic variables are obtained from the main BN (indicated by the blue broken lines connecting them).
  • Figure 2: Data collection map for cases contained in the COVID-19 dataset provided by CDC. States with cases in the dataset are the orange ones, while those that are grey have no cases in the dataset.
  • Figure 3: DAG showing relationship between predictor variables $\mathcal{F}_c$ and target variables $\mathcal{F}_t$ to facilitate features selection to train the DSID model.
  • Figure 4: DAG showing conditional dependencies between severity variables (G, N, O, P) in color purple and their parents in color orange.
  • Figure 5: Conditional probability distribution showing how the causal variables (see legend map) cause death across different age-groups (x-axis). The subplot in (a) represents the probability for when death = No, and the subplot in (b) is probability for when death = Yes.
  • ...and 5 more figures