Table of Contents
Fetching ...

Oversampling techniques for predicting COVID-19 patient length of stay

Zachariah Farahany, Jiawei Wu, K M Sajjadul Islam, Praveen Madiraju

TL;DR

This work tackles predicting COVID-19 patient severity via the length of stay ($LOS$) using de-identified EHR data, addressing severe class imbalance with extensive oversampling and an Artificial Neural Network ($ANN$) whose hyperparameters are optimized through Bayesian optimization. Five training schemes (raw, weighted, oversampled, undersampled, and SMOTE-NC) are compared, with the raw-data pipeline delivering the best overall F1 performance ($F1$ up to 85.79%), while undersampling suffers from data scarcity. Mutual information analysis identifies clinical signals such as magnesium levels and IV infusions as informative for LOS predictions, offering interpretable insights for hospital workflows. The study demonstrates that hyperparameter-tuned deep learning can enhance patient flow and resource allocation, though its generalizability is limited by dataset homogeneity and single-center scope.

Abstract

COVID-19 is a respiratory disease that caused a global pandemic in 2019. It is highly infectious and has the following symptoms: fever or chills, cough, shortness of breath, fatigue, muscle or body aches, headache, the new loss of taste or smell, sore throat, congestion or runny nose, nausea or vomiting, and diarrhea. These symptoms vary in severity; some people with many risk factors have been known to have lengthy hospital stays or die from the disease. In this paper, we analyze patients' electronic health records (EHR) to predict the severity of their COVID-19 infection using the length of stay (LOS) as our measurement of severity. This is an imbalanced classification problem, as many people have a shorter LOS rather than a longer one. To combat this problem, we synthetically create alternate oversampled training data sets. Once we have this oversampled data, we run it through an Artificial Neural Network (ANN), which during training has its hyperparameters tuned using Bayesian optimization. We select the model with the best F1 score and then evaluate it and discuss it.

Oversampling techniques for predicting COVID-19 patient length of stay

TL;DR

This work tackles predicting COVID-19 patient severity via the length of stay () using de-identified EHR data, addressing severe class imbalance with extensive oversampling and an Artificial Neural Network () whose hyperparameters are optimized through Bayesian optimization. Five training schemes (raw, weighted, oversampled, undersampled, and SMOTE-NC) are compared, with the raw-data pipeline delivering the best overall F1 performance ( up to 85.79%), while undersampling suffers from data scarcity. Mutual information analysis identifies clinical signals such as magnesium levels and IV infusions as informative for LOS predictions, offering interpretable insights for hospital workflows. The study demonstrates that hyperparameter-tuned deep learning can enhance patient flow and resource allocation, though its generalizability is limited by dataset homogeneity and single-center scope.

Abstract

COVID-19 is a respiratory disease that caused a global pandemic in 2019. It is highly infectious and has the following symptoms: fever or chills, cough, shortness of breath, fatigue, muscle or body aches, headache, the new loss of taste or smell, sore throat, congestion or runny nose, nausea or vomiting, and diarrhea. These symptoms vary in severity; some people with many risk factors have been known to have lengthy hospital stays or die from the disease. In this paper, we analyze patients' electronic health records (EHR) to predict the severity of their COVID-19 infection using the length of stay (LOS) as our measurement of severity. This is an imbalanced classification problem, as many people have a shorter LOS rather than a longer one. To combat this problem, we synthetically create alternate oversampled training data sets. Once we have this oversampled data, we run it through an Artificial Neural Network (ANN), which during training has its hyperparameters tuned using Bayesian optimization. We select the model with the best F1 score and then evaluate it and discuss it.

Paper Structure

This paper contains 15 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Data pipeline
  • Figure 2: Training pipeline
  • Figure 3: Distribution of the LOS
  • Figure 4: Two component PCA with classes included
  • Figure 5: Confusion Matrix Model #1
  • ...and 2 more figures