Table of Contents
Fetching ...

Bayesian Nonparametric Dimensionality Reduction of Categorical Data for Predicting Severity of COVID-19 in Pregnant Women

Marzieh Ajirak, Cassandra Heiselman, Anna Fuchs, Mia Heiligenstein, Kimberly Herrera, Diana Garretto, Petar Djuric

TL;DR

The paper addresses the challenge of predicting COVID-19 severity in pregnant women from high-dimensional multivariate categorical data. It introduces a discrete-GPLVM approach that maps categorical features into a $Q$-dimensional latent space via Gaussian process priors and a softmax likelihood, with variational inducing-point inference for scalability. Key contributions include the formulation of a multivariate discrete GPLVM, Monte Carlo-based ELBO optimization, and demonstrated improvements in predictive performance over one-hot baselines on synthetic and real COVID-19 pregnancy data, along with latent-space visualizations that separate severity groups. The work offers a data-efficient method for extracting latent structure from sparse categorical clinical data, enabling better risk stratification and decision support in pregnancy during the COVID-19 era.

Abstract

The coronavirus disease (COVID-19) has rapidly spread throughout the world and while pregnant women present the same adverse outcome rates, they are underrepresented in clinical research. We collected clinical data of 155 test-positive COVID-19 pregnant women at Stony Brook University Hospital. Many of these collected data are of multivariate categorical type, where the number of possible outcomes grows exponentially as the dimension of data increases. We modeled the data within the unsupervised Bayesian framework and mapped them into a lower-dimensional space using latent Gaussian processes. The latent features in the lower dimensional space were further used for predicting if a pregnant woman would be admitted to a hospital due to COVID-19 or would remain with mild symptoms. We compared the prediction accuracy with the dummy/one-hot encoding of categorical data and found that the latent Gaussian process had better accuracy.

Bayesian Nonparametric Dimensionality Reduction of Categorical Data for Predicting Severity of COVID-19 in Pregnant Women

TL;DR

The paper addresses the challenge of predicting COVID-19 severity in pregnant women from high-dimensional multivariate categorical data. It introduces a discrete-GPLVM approach that maps categorical features into a -dimensional latent space via Gaussian process priors and a softmax likelihood, with variational inducing-point inference for scalability. Key contributions include the formulation of a multivariate discrete GPLVM, Monte Carlo-based ELBO optimization, and demonstrated improvements in predictive performance over one-hot baselines on synthetic and real COVID-19 pregnancy data, along with latent-space visualizations that separate severity groups. The work offers a data-efficient method for extracting latent structure from sparse categorical clinical data, enabling better risk stratification and decision support in pregnancy during the COVID-19 era.

Abstract

The coronavirus disease (COVID-19) has rapidly spread throughout the world and while pregnant women present the same adverse outcome rates, they are underrepresented in clinical research. We collected clinical data of 155 test-positive COVID-19 pregnant women at Stony Brook University Hospital. Many of these collected data are of multivariate categorical type, where the number of possible outcomes grows exponentially as the dimension of data increases. We modeled the data within the unsupervised Bayesian framework and mapped them into a lower-dimensional space using latent Gaussian processes. The latent features in the lower dimensional space were further used for predicting if a pregnant woman would be admitted to a hospital due to COVID-19 or would remain with mild symptoms. We compared the prediction accuracy with the dummy/one-hot encoding of categorical data and found that the latent Gaussian process had better accuracy.

Paper Structure

This paper contains 9 sections, 17 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Latent weights and inducing variables.
  • Figure 2: A graphical representation of the generative model.
  • Figure 3: Model Pipeline
  • Figure 4: Synthetic Example
  • Figure 5: Visualization of the patients. Blue circles represent asymptomatic patients or patients with mild symptoms, red circles represent patients who were hospitalized and red crosses are patients who were admitted to ICU.